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Prologue 



This book is devoted to the theory of probabilistic information measures and 
their application to coding theorems for information sources and noisy chan- 
nels. The eventual goal is a general development of Shannon’s mathematical 
theory of communication, but much of the space is devoted to the tools and 
methods required to prove the Shannon coding theorems. These tools form an 
area common to ergodic theory and information theory and comprise several 
quantitative notions of the information in random variables, random processes, 
and dynamical systems. Examples are entropy, mutual information, conditional 
entropy, conditional information, and discrimination or relative entropy, along 
with the limiting normalized versions of these quantities such as entropy rate 
and information rate. Much of the book is concerned with their properties, es- 
pecially the long term asymptotic behavior of sample information and expected 
information. 

The book has been strongly influenced by M. S. Pinsker’s classic Information 
and Information Stability of Random Variables and Processes and by the seminal 
work of A. N. Kolmogorov, I. M. Gelfand, A. M. Yaglom, and R. L. Dobruslrin on 
information measures for abstract alphabets and their convergence properties. 
Many of the results herein are extensions of their generalizations of Shannon’s 
original results. The mathematical models of this treatment are more general 
than traditional treatments in that nonstationary and nonergodic information 
processes are treated. The models are somewhat less general than those of the 
Soviet school of information theory in the sense that standard alphabets rather 
than completely abstract alphabets are considered. This restriction, however, 
permits many stronger results as well as the extension to nonergodic processes. 
In addition, the assumption of standard spaces simplifies many proofs and such 
spaces include as examples virtually all examples of engineering interest. 

The information convergence results are combined with ergodic theorems 
to prove general Shannon coding theorems for sources and channels. The re- 
sults are not the most general known and the converses are not the strongest 
available, but they are sufficently general to cover most systems encountered 
in applications and they provide an introduction to recent extensions requiring 
significant additional mathematical machinery. Several of the generalizations 
have not previously been treated in book form. Examples of novel topics for an 
information theory text include asymptotic mean stationary sources, one-sided 
sources as well as two-sided sources, nonergodic sources, d-continuous channels, 
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and sliding block codes. Another novel aspect is the use of recent proofs of 
general Shannon-McMillan-Breiman theorems which do not use martingale the- 
ory: A coding proof of Ornstein and Weiss [117] is used to prove the almost 
everywhere convergence of sample entropy for discrete alphabet processes and 
a variation on the sandwich approach of Algoet and Cover [7] is used to prove 
the convergence of relative entropy densities for general standard alphabet pro- 
cesses. Both results are proved for asymptotically mean stationary processes 
which need not be ergodic. 

This material can be considered as a sequel to my book Probability, Random 
Processes, and Ergodic Properties [51] wherein the prerequisite results on prob- 
ability, standard spaces, and ordinary ergodic properties may be found. This 
book is self contained with the exception of common (and a few less common) 
results which may be found in the first book. 

It is my hope that the book will interest engineers in some of the mathemat- 
ical aspects and general models of the theory and mathematicians in some of 
the important engineering applications of performance bounds and code design 
for communication systems. 

Information theory or the mathematical theory of communication has two 
primary goals: The first is the development of the fundamental theoretical lim- 
its on the achievable performance when communicating a given information 
source over a given communications channel using coding schemes from within 
a prescribed class. The second goal is the development of coding schemes that 
provide performance that is reasonably good in comparison with the optimal 
performance given by the theory. Information theory was born in a surpris- 
ingly rich state in the classic papers of Claude E. Shannon [129] [130] which 
contained the basic results for simple memoryless sources and channels and in- 
troduced more general communication systems models, including finite state 
sources and channels. The key tools used to prove the original results and many 
of those that followed were special cases of the ergodic theorem and a new vari- 
ation of the ergodic theorem which considered sample averages of a measure of 
the entropy or self information in a process. 

Information theory can be viewed as simply a branch of applied probability 
theory. Because of its dependence on ergodic theorems, however, it can also be 
viewed as a branch of ergodic theory, the theory of invariant transformations 
and transformations related to invariant transformations. In order to develop 
the ergodic theory example of principal interest to information theory, suppose 
that one has a random process, which for the moment we consider as a sam- 
ple space or ensemble of possible output sequences together with a probability 
measure on events composed of collections of such sequences. The shift is the 
transformation on this space of sequences that takes a sequence and produces a 
new sequence by shifting the first sequence a single time unit to the left. In other 
words, the shift transformation is a mathematical model for the effect of time 
on a data sequence. If the probability of any sequence event is unchanged by 
shifting the event, that is, by shifting all of the sequences in the event, then the 
shift transformation is said to be invariant and the random process is said to be 
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stationary. Thus the theory of stationary random processes can be considered as 
a subset of ergodic theory. Transformations that are not actually invariant (ran- 
dom processes which are not actually stationary) can be considered using similar 
techniques by studying transformations which are almost invariant, which are 
invariant in an asymptotic sense, or which are dominated or asymptotically 
dominated in some sense by an invariant transformation. This generality can 
be important as many real processes are not well modeled as being stationary. 
Examples are processes with transients, processes that have been parsed into 
blocks and coded, processes that have been encoded using variable-length codes 
or finite state codes and channels with arbitrary starting states. 

Ergodic theory was originally developed for the study of statistical mechanics 
as a means of quantifying the trajectories of physical or dynamical systems. 
Hence, in the language of random processes, the early focus was on ergodic 
theorems: theorems relating the time or sample average behavior of a random 
process to its ensemble or expected behavior. The work of Hoplr [65], von 
Neumann [146] and others culminated in the pointwise or almost everywhere 
ergodic theorem of Birklroff [16]. 

In the 1940’s and 1950’s Shannon made use of the ergodic theorem in the 
simple special case of memoryless processes to characterize the optimal perfor- 
mance theoretically achievable when communicating information sources over 
constrained random media called channels. The ergodic theorem was applied 
in a direct fashion to study the asymptotic behavior of error frequency and 
time average distortion in a communication system, but a new variation was 
introduced by defining a mathematical measure of the entropy or information 
in a random process and characterizing its asymptotic behavior. These results 
are known as coding theorems. Results describing performance that is actually 
achievable, at least in the limit of unbounded complexity and time, are known as 
positive coding theorems. Results providing unbeatable bounds on performance 
are known as converse coding theorems or negative coding theorems. When the 
same quantity is given by both positive and negative coding theorems, one has 
exactly the optimal performance theoretically achievable by the given commu- 
nication systems model. 

While mathematical notions of information had existed before, it was Shan- 
non who coupled the notion with the ergodic theorem and an ingenious idea 
known as “random coding” in order to develop the coding theorems and to 
thereby give operational significance to such information measures. The name 
“random coding” is a bit misleading since it refers to the random selection of 
a deterministic code and not a coding system that operates in a random or 
stochastic manner. The basic approach to proving positive coding theorems 
was to analyze the average performance over a random selection of codes. If 
the average is good, then there must be at least one code in the ensemble of 
codes with performance as good as the average. The ergodic theorem is cru- 
cial to this argument for determining such average behavior. Unfortunately, 
such proofs promise the existence of good codes but give little insight into their 
construction. 

Shannon’s original work focused on memoryless sources whose probability 
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distribution did not change with time and whose outputs were drawn from a fi- 
nite alphabet or the real line. In this simple case the well-known ergodic theorem 
immediately provided the required result concerning the asymptotic behavior of 
information. He observed that the basic ideas extended in a relatively straight- 
forward manner to more complicated Markov sources. Even this generalization, 
however, was a far cry from the general stationary sources considered in the 
ergodic theorem. 

To continue the story requires a few additional words about measures of 
information. Shannon really made use of two different but related measures. 
The first was entropy, an idea inherited from thermodynamics and previously 
proposed as a measure of the information in a random signal by Hartley [64]. 
Shannon defined the entropy of a discrete time discrete alphabet random pro- 
cess {X n }, which we denote by H(X) while deferring its definition, and made 
rigorous the idea that the the entropy of a process is the amount of informa- 
tion in the process. He did this by proving a coding theorem showing that 
if one wishes to code the given process into a sequence of binary symbols so 
that a receiver viewing the binary sequence can reconstruct the original process 
perfectly (or nearly so), then one needs at least H(X) binary symbols or bits 
(converse theorem) and one can accomplish the task with very close to H(X) 
bits (positive theorem). This coding theorem is known as the noiseless source 
coding theorem. 

The second notion of information used by Shannon was mutual information. 
Entropy is really a notion of self information-the information provided by a 
random process about itself. Mutual information is a measure of the information 
contained in one process about another process. While entropy is sufficient to 
study the reproduction of a single process through a noiseless environment, more 
often one has two or more distinct random processes, e.g., one random process 
representing an information source and another representing the output of a 
communication medium wherein the coded source has been corrupted by another 
random process called noise. In such cases observations are made on one process 
in order to make decisions on another. Suppose that {X n ,Y n } is a random 
process with a discrete alphabet, that is, taking on values in a discrete set. The 
coordinate random processes {X n } and {Y n } might correspond, for example, 
to the input and output of a communication system. Shannon introduced the 
notion of the average mutual information between the two processes: 

I(X,Y) = H(X) + H(Y)-H(X,Y), (1) 

the sum of the two self entropies minus the entropy of the pair. This proved to 
be the relevant quantity in coding theorems involving more than one distinct 
random process: the channel coding theorem describing reliable communication 
through a noisy channel, and the general source coding theorem describing the 
coding of a source for a user subject to a fidelity criterion. The first theorem 
focuses on error detection and correction and the second on analog-to-digital 
conversion and data compression. Special cases of both of these coding theorems 
were given in Shannon’s original work. 
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Average mutual information can also be defined in terms of conditional en- 
tropy (or equivocation ) H(X\Y) = H(X,Y) — H{Y) and hence 

I(X, Y) = H(X) - H(X\Y) = H(Y) - H(X\Y). (2) 

In this form the mutual information can be interpreted as the information con- 
tained in one process minus the information contained in the process when the 
other process is known. While elementary texts on information theory abound 
with such intuitive descriptions of information measures, we will minimize such 
discussion because of the potential pitfall of using the interpretations to apply 
such measures to problems where they are not appropriate. ( See, e.g., P. Elias’ 
“Information theory, photosynthesis, and religion” in his “Two famous papers” 
[36].) Information measures are important because coding theorems exist im- 
buing them with operational significance and not because of intuitively pleasing 
aspects of their definitions. 

We focus on the definition (1) of mutual information since it does not require 
any explanation of what conditional entropy means and since it has a more 
symmetric form than the conditional definitions. It turns out that H(X,X) = 
H{X) (the entropy of a random variable is not changed by repeating it) and 
hence from (1) 

I{X,X)=H(X) (3) 

so that entropy can be considered as a special case of average mutual informa- 
tion. 

To return to the story, Shannon’s work spawned the new field of information 
theory and also had a profound effect on the older field of ergodic theory. 

Information theorists, both mathematicians and engineers, extended Shan- 
non’s basic approach to ever more general models of information sources, coding 
structures, and performance measures. The fundamental ergodic theorem for 
entropy was extended to the same generality as the ordinary ergodic theorems by 
McMillan [103] and Breiman [19] and the result is now known as the Shannon- 
McMillan-Breiman theorem. (Other names are the asymptotic equipartition 
theorem or AEP, the ergodic theorem of information theory, and the entropy 
theorem.) A variety of detailed proofs of the basic coding theorems and stronger 
versions of the theorems for memoryless, Markov, and other special cases of ran- 
dom processes were developed, notable examples being the work of Feinstein [38] 
[39] and Wolfowitz (see, e.g., Wolfowitz [151].) The ideas of measures of infor- 
mation, channels, codes, and communications systems were rigorously extended 
to more general random processes with abstract alphabets and discrete and 
continuous time by Khinchine [72], [73] and by Kolmogorov and his colleagues, 
especially Gelfand, Yaglom, Dobrushin, and Pinsker [45], [90], [87], [32], [125]. 
(See, for example, “Kolmogorov’s contributions to information theory and algo- 
rithmic complexity” [23].) In almost all of the early Soviet work, it was average 
mutual information that played the fundamental role. It was the more natural 
quantity when more than one process were being considered. In addition, the 
notion of entropy was not useful when dealing with processes with continuous 
alphabets since it is virtually always infinite in such cases. A generalization of 
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the idea of entropy called discrimination was developed by Kullback (see, e.g., 
Kullback [92]) and was further studied by the Soviet school. This form of infor- 
mation measure is now more commonly referred to as relative entropy or cross 
entropy (or Kullback-Leibler number) and it is better interpreted as a measure 
of similarity between probability distributions than as a measure of information 
between random variables. Many results for mutual information and entropy 
can be viewed as special cases of results for relative entropy and the formula for 
relative entropy arises naturally in some proofs. 

It is the mathematical aspects of information theory and hence the descen- 
dants of the above results that are the focus of this book, but the developments 
in the engineering community have had as significant an impact on the founda- 
tions of information theory as they have had on applications. Simpler proofs of 
the basic coding theorems were developed for special cases and, as a natural off- 
shoot, the rate of convergence to the optimal performance bounds characterized 
in a variety of important cases. See, e.g., the texts by Gallager [43], Berger [11], 
and Csiszar and Korner [26] . Numerous practicable coding techniques were de- 
veloped which provided performance reasonably close to the optimum in many 
cases: from the simple linear error correcting and detecting codes of Slepian 
[137] to the huge variety of algebraic codes currently being implemented (see, 
e.g., [13], [148], [95], [97], [18]) and the various forms of convolutional, tree, and 
trellis codes for error correction and data compression (see, e.g., [145], [69]). 
Clustering techniques have been used to develop good nonlinear codes (called 
“vector quantizers” ) for data compression applications such as speech and image 
coding [49], [46], [99], [69], [118]. These clustering and trellis search techniques 
have been combined to form single codes that combine the data compression 
and reliable communication operations into a single coding system [8] . 

The engineering side of information theory through the middle 1970’s has 
been well chronicled by two IEEE collections: Key Papers in the Development 
of Information Theory, edited by D. Slepian [138], and Key Papers in the Devel- 
opment of Coding Theory, edited by E. Berlekamp [14] . In addition there have 
been several survey papers describing the history of information theory during 
each decade of its existence published in the IEEE Transactions on Information 
Theory. 

The influence on ergodic theory of Shannon’s work was equally great but in 
a different direction. After the development of quite general ergodic theorems, 
one of the principal issues of ergodic theory was the isomorphism problem, the 
characterization of conditions under which two dynamical systems are really the 
same in the sense that each could be obtained from the other in an invertible 
way by coding. Here, however, the coding was not of the variety considered 
by Shannon: Shannon considered block codes, codes that parsed the data into 
nonoverlapping blocks or windows of finite length and separately mapped each 
input block into an output block. The more natural construct in ergodic theory 
can be called a sliding block code: Here the encoder views a block of possibly 
infinite length and produces a single symbol of the output sequence using some 
mapping (or code or filter). The input sequence is then shifted one time unit to 
the left, and the same mapping applied to produce the next output symbol, and 
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so on. This is a smoother operation than the block coding structure since the 
outputs are produced based on overlapping windows of data instead of on a com- 
pletely different set of data each time. Unlike the Shannon codes, these codes 
will produce stationary output processes if given stationary input processes. It 
should be mentioned that examples of such sliding block codes often occurred 
in the information theory literature: time-invariant convolutional codes or, sim- 
ply, time-invariant linear filters are sliding block codes. It is perhaps odd that 
virtually all of the theory for such codes in the information theory literature 
was developed by effectively considering the sliding block codes as very long 
block codes. Recently sliding block codes have proved a useful structure for the 
design of noiseless codes for constrained alphabet channels such as magnetic 
recording devices, and techniques from symbolic dynamics have been applied to 
the design of such codes. See, for example [3], [100]. 

Shannon’s noiseless source coding theorem suggested a solution to the iso- 
morphism problem: If we assume for the moment that one of the two processes 
is binary, then perfect coding of a process into a binary process and back into 
the original process requires that the original process and the binary process 
have the same entropy. Thus a natural conjecture is that two processes are iso- 
morphic if and only if they have the same entropy. A major difficulty was the 
fact that two different kinds of coding were being considered: stationary sliding 
block codes with zero error by the ergodic theorists and either fixed length block 
codes with small error or variable length (and hence nonstationary) block codes 
with zero error by the Shannon theorists. While it was plausible that the former 
codes might be developed as some sort of limit of the latter, this proved to be 
an extremely difficult problem. It was Kolmogorov [88], [89] who first reasoned 
along these lines and proved that in fact equal entropy (appropriately defined) 
was a necessary condition for isomorphism. 

Kolmogorov’s seminal work initiated a new branch of ergodic theory devoted 
to the study of entropy of dynamical systems and its application to the isomor- 
phism problem. Most of the original work was done by Soviet mathematicians; 
notable papers are those by Sinai [134] [135] (in ergodic theory entropy is also 
known as the Kolmogorov-Sinai invariant), Pinsker [125], and Rohlin and Sinai 
[127]. An actual construction of a perfectly noiseless sliding block code for a 
special case was provided by Meshalkin [104]. While much insight was gained 
into the behavior of entropy and progress was made on several simplified ver- 
sions of the isomorphism problem, it was several years before Ornstein [114] 
proved a result that has since come to be known as the Kolmogorov-Ornstein 
isomorphism theorem. 

Ornstein showed that if one focused on a class of random processes which 
we shall call B-processes, then two processes are indeed isomorphic if and only 
if they have the same entropy. B-processes have several equivalent definitions, 
perhaps the simplest is that they are processes which can be obtained by encod- 
ing a memory less process using a sliding block code. This class remains the most 
general class known for which the isomorphism conjecture holds. In the course 
of his proof, Ornstein developed intricate connections between block coding and 
sliding block coding. He used Shannonlike techniques on the block codes, then 




PROLOGUE 



xviii 

imbedded the block codes into sliding block codes, and then used the stationary 
structure of the sliding block codes to advantage in limiting arguments to obtain 
the required zero error codes. Several other useful techniques and results were 
introduced in the proof: notions of the distance between processes and relations 
between the goodness of approximation and the difference of entropy. Ornstein 
expanded these results into a book [116] and gave a tutorial discussion in the 
premier issue of the Annals of Probability [115]. Several correspondence items 
by other ergodic theorists discussing the paper accompanied the article. 

The origins of this book lie in the tools developed by Ornstein for the proof 
of the isomorphism theorem rather than with the result itself. During the early 
1970’s I first become interested in ergodic theory because of joint work with Lee 
D. Davisson on source coding theorems for stationary nonergodic processes. The 
ergodic decomposition theorem discussed in Ornstein [115] provided a needed 
missing link and led to an intense campaign on my part to learn the funda- 
mentals of ergodic theory and perhaps find other useful tools. This effort was 
greatly eased by Paul Shields’ book The Theory of Bernoulli Shifts [131] and by 
discussions with Paul on topics in both ergodic theory and information theory. 
This in turn led to a variety of other applications of ergodic theoretic techniques 
and results to information theory, mostly in the area of source coding theory: 
proving source coding theorems for sliding block codes and using process dis- 
tance measures to prove universal source coding theorems and to provide new 
characterizations of Shannon distortion-rate functions. The work was done with 
Dave Neuhoff, like me then an apprentice ergodic theorist, and Paul Shields. 

With the departure of Dave and Paul from Stanford, my increasing inter- 
est led me to discussions with Don Ornstein on possible applications of his 
techniques to channel coding problems. The interchange often consisted of my 
describing a problem, his generation of possible avenues of solution, and then 
my going off to work for a few weeks to understand his suggestions and work 
them through. 

One problem resisted our best efforts-how to synchronize block codes over 
channels with memory, a prerequisite for constructing sliding block codes for 
such channels. In 1975 I had the good fortune to meet and talk with Roland Do- 
brushin at the 1975 IEEE/USSR Workshop on Information Theory in Moscow. 
He observed that some of his techniques for handling synchronization in memo- 
ryless channels should immediately generalize to our case and therefore should 
provide the missing link. The key elements were all there, but it took seven 
years for the paper by Ornstein, Dobrushin and me to evolve and appear [59]. 

Early in the course of the channel coding paper, I decided that having the 
solution to the sliding block channel coding result in sight was sufficient excuse 
to write a book on the overlap of ergodic theory and information theory. The 
intent was to develop the tools of ergodic theory of potential use to information 
theory and to demonstrate their use by proving Shannon coding theorems for 
the most general known information sources, channels, and code structures. 
Progress on the book was disappointingly slow, however, for a number of reasons. 
As delays mounted, I saw many of the general coding theorems extended and 
improved by others (often by J. C. Kieffer) and new applications of ergodic 
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theory to information theory developed, such as the channel modeling work 
of Neuhoff and Shields [110], [113], [112], [111] and design methods for sliding 
block codes for input restricted noiseless channels by Adler, Coppersmith, and 
Hasner [3] and Marcus [100]. Although I continued to work in some aspects of 
the area, especially with nonstationary and nonergodic processes and processes 
with standard alphabets, the area remained for me a relatively minor one and 
I had little time to write. Work and writing came in bursts during sabbaticals 
and occasional advanced topic seminars. I abandoned the idea of providing the 
most general possible coding theorems and decided instead to settle for coding 
theorems that were sufficiently general to cover most applications and which 
possessed proofs I liked and could understand. The mantle of the most general 
theorems will go to a book in progress by J.C. Kieffer [85]. That book shares 
many topics with this one, but the approaches and viewpoints and many of the 
results treated are quite different. At the risk of generalizing, the books will 
reflect our differing backgrounds: mine as an engineer by training and a would- 
be mathematician, and his as a mathematician by training migrating to an 
engineering school. The proofs of the principal results often differ in significant 
ways and the two books contain a variety of different minor results developed 
as tools along the way. This book is perhaps more “old fashioned” in that 
the proofs often retain the spirit of the original “classical” proofs, while Kieffer 
has developed a variety of new and powerful techniques to obtain the most 
general known results. I have also taken more detours along the way in order 
to catalog various properties of entropy and other information measures that I 
found interesting in their own right, even though they were not always necessary 
for proving the coding theorems. Only one third of this book is actually devoted 
to Shannon source and channel coding theorems; the remainder can be viewed 
as a monograph on information measures and their properties, especially their 
ergodic properties. 

Because of delays in the original project, the book was split into two smaller 
books and the first, Probability, Random Processes, and Ergodic Properties, 
was published by Springer- Verlag in 1988 [50]. It treats advanced probability 
and random processes with an emphasis on processes with standard alphabets, 
on nonergodic and nonstationary processes, and on necessary and sufficient 
conditions for the convergence of long term sample averages. Asymptotically 
mean stationary sources and the ergodic decomposition are there treated in 
depth and recent simplified proofs of the ergodic theorem due to Ornstein and 
Weiss [117] and others were incorporated. That book provides the background 
material and introduction to this book, the split naturally falling before the 
introduction of entropy. The first chapter of this book reviews some of the basic 
notation of the first one in information theoretic terms, but results are often 
simply quoted as needed from the first book without any attempt to derive 
them. The two books together are self-contained in that all supporting results 
from probability theory and ergodic theory needed here may be found in the 
first book. This book is self-contained so far as its information theory content, 
but it should be considered as an advanced text on the subject and not as an 
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introductory treatise to the reader only wishing an intuitive overview. 

Here the Slrannon-McMillan-Breiman theorem is proved using the coding 
approach of Ornstein and Weiss [117] (see also Shield’s tutorial paper [132]) 
and hence the treatments of ordinary ergodic theorems in the first book and the 
ergodic theorems for information measures in this book are consistent. The ex- 
tension of the Shannon-McMillan-Breiman theorem to densities is proved using 
the “sandwich” approach of Algoet and Cover [7], which depends strongly on 
the usual pointwise or Birklroff ergodic theorem: sample entropy is asymptot- 
ically sandwiched between two functions whose limits can be determined from 
the ergodic theorem. These results are the most general yet published in book 
form and differ from traditional developments in that martingale theory is not 
required in the proofs. 

A few words are in order regarding topics that are not contained in this 
book. I have not included multiuser information theory for two reasons: First, 
after including the material that I wanted most, there was no room left. Second, 
my experience in the area is slight and I believe this topic can be better handled 
by others. Results as general as the single user systems described here have not 
yet been developed. Good surveys of the multiuser area may be found in El 
Gamal and Cover [44], van der Meulen [142], and Berger [12]. 

Traditional noiseless coding theorems and actual codes such as the Huff- 
man codes are not considered in depth because quite good treatments exist in 
the literature, e.g., [43], [1], [102]. The corresponding ergodic theory result- 
the Kolmogorov-Ornstein isomorphism theorem-is also not proved, because its 
proof is difficult and the result is not needed for the Shannon coding theorems. 
Many techniques used in its proof, however, are used here for similar and other 
purposes. 

The actual computation of channel capacity and distortion rate functions 
has not been included because existing treatments [43], [17], [11], [52] are quite 
adequate. 

This book does not treat code design techniques. Algebraic coding is well 
developed in existing texts on the subject [13], [148], [95], [18]. Allen Gersho and 
I are currently writing a book on the theory and design of nonlinear coding tech- 
niques such as vector quantizers and trellis codes for analog-to-digital conversion 
and for source coding (data compression) and combined source and channel cod- 
ing applications [47]. A less mathematical treatment of rate-distortion theory 
along with other source coding topics not treated here (including asymptotic, 
or high rate, quantization theory and uniform quantizer noise theory) may be 
found in my book [52]. 

Universal codes, codes which work well for an unknown source, and variable 
rate codes, codes producing a variable number of bits for each input vector, are 
not considered. The interested reader is referred to [109] [96] [77] [78] [28] and 
the references therein. 

A recent active research area that has made good use of the ideas of rel- 
ative entropy to characterize exponential growth is that of large deviations 
theory[143][31]. These techniques have been used to provide new proofs of the 
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basic source coding theorems [22]. These topics are not treated here. 

Lastly, J. C. Kieffer has recently developed a powerful new ergodic theorem 
that can be used to prove both traditional ergodic theorems and the extended 
Slrannon-McMillan-Brieman theorem [83]. He has used this theorem to prove 
new strong (almost everywhere) versions of the souce coding theorem and its 
converse, that is, results showing that sample average distortion is with proba- 
bility one no smaller than the distortion-rate function and that there exist codes 
with sample average distortion arbitrarily close to the distortion-rate function 
[84] [82]. These results should have a profound impact on the future develop- 
ment of the theoretical tools and results of information theory. Their imminent 
publication provide a strong motivation for the completion of this monograph, 
which is devoted to the traditional methods. Tradition has its place, however, 
and the methods and results treated here should retain much of their role at the 
core of the theory of entropy and information. It is hoped that this collection 
of topics and methods will find a niche in the literature. 

19 November 2000 Revision The original edition went out of print in 
2000. Hence I took the opportunity to fix more typos which have been brought 
to my attention (thanks in particular to Yariv Ephraim) and to prepare the book 
for Web posting. This is done with the permission of the original publisher and 
copyright-holder, Springer- Verlag. I hope someday to do some more serious 
revising, but for the moment I am content to fix the known errors and make the 
manuscript available. 
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Chapter 1 



Information Sources 



1.1 Introduction 

An information source or source is a mathematical model for a physical entity 
that produces a succession of symbols called “outputs” in a random manner. 
The symbols produced may be real numbers such as voltage measurements from 
a transducer, binary numbers as in computer data, two dimensional intensity 
fields as in a sequence of images, continuous or discontinuous waveforms, and 
so on. The space containing all of the possible output symbols is called the 
alphabet of the source and a source is essentially an assignment of a probability 
measure to events consisting of sets of sequences of symbols from the alphabet. 
It is useful, however, to explicitly treat the notion of time as a transformation 
of sequences produced by the source. Thus in addition to the common random 
process model we shall also consider modeling sources by dynamical systems as 
considered in ergodic theory. 

The material in this chapter is a distillation of [50] and is intended to estab- 
lish notation. 



1.2 Probability Spaces and Random Variables 

A measurable space (fi, B) is a pair consisting of a sample space together with 
a (7-field B of subsets of O (also called the event space). A cr-field or er-algebra 
B is a nonempty collection of subsets of O with the following properties: 



n e b. 


(1.1) 


If F € B, then F c = {w : u> ft F} € B. 


(1.2) 


If Fi e B; i = 1, 2, • • • , then (J F t G B. 


(1.3) 



l 
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From de Morgan’s “laws” of elementary set theory it follows that also 

OO OO 

0 = (U*?) c e*. 

i—1 i - 1 

An event space is a collection of subsets of a sample space (called events by 
virtue of belonging to the event space) such that any countable sequence of set 
theoretic operations (union, intersection, complementation) on events produces 
other events. Note that there are two extremes: the largest possible a - field of 
fi is the collection of all subsets of fi (sometimes called the power set ) and the 
smallest possible cr-field is {fi, 0}, the entire space together with the null set 
0 = fi c (called the trivial space). 

If instead of the closure under countable unions required by (1.2.3), we only 
require that the collection of subsets be closed under finite unions, then we say 
that the collection of subsets is a field. 

While the concept of a field is simpler to work with, a cr-field possesses the 
additional important property that it contains all of the limits of sequences of 
sets in the collection. That is, if F n , n = 1,2, • • • is an increasing sequence of 
sets in a cr-field, that is, if F n _i C F n and if F = U^Li F n (in which case we 
write F n ] F or limn^oo F n = F), then also F is contained in the cr-field. In 
a similar fashion we can define decreasing sequences of sets: If F n decreases to 
F in the sense that F n+ 1 C F n and F = f)^Li F ra , then we write F n { F. If 
F n € B for all n, then F € B. 

A probability space (fi, B , P) is a triple consisting of a sample space fi , a cr- 
field B of subsets of fi , and a probability measure P which assigns a real number 
P(F) to every member F of the cr-field B so that the following conditions are 
satisfied: 



• Nonnegativity: 








P(F) > 0, all F G B; 


(1.4) 


• Normalization: 








P( fi) = l; 


(1.5) 



• Countable Additivity: 



If Fi e B, i = 1, 2, • • • are disjoint, then 

OO OO 

P (U F *) = E P (^)- (1-6) 

i = 1 i = 1 

A set function P satisfying only (1.2.4) and (1.2.6) but not necessarily (1.2.5) 
is called a measure and the triple (f 1,B,P) is called a measure space. Since the 
probability measure is defined on a cr-field, such countable unions of subsets of 
fi in the cr-field are also events in the cr-field. 

A standard result of basic probability theory is that if G n { 0 (the empty or 
null set), that is, if G n+ \ C G n for all n and G n = 0 , then we have 
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• Continuity at 0: 

lim P(G n ) = 0. (1.7) 

n—> oo 

similarly it follows that we have 

• Continuity from Below : 

If F n T F, then lim P{F n ) = P(F), (1.8) 

n—*oo 

and 

• Continuity from Above: 

If F n [ F, then lim P{F n ) = P(F). (1.9) 

n— ► oo 

Given a measurable space (f l,B), a collection Q of members of B is said to 
generate B and we write a(Q) = B if B is the smallest cr-field that contains Q\ 
that is, if a cr-field contains all of the members of Q, then it must also contain all 
of the members of B. The following is a fundamental approximation theorem of 
probability theory. A proof may be found in Corollary 1.5.3 of [50]. The result 
is most easily stated in terms of the symmetric difference A defined by 

FAG= (Fp|G c )p|(F c (jG). 

Theorem 1.2.1: Given a probability space (fi, B , P) and a generating field 
F, that is, F is a field and B = o(F ), then given F G B and e > 0, there exists 
an Fo G F such that P(FAF 0 ) < e. 

Let ( A,Ba ) denote another measurable space. A random variable or mea- 
surable function defined on (f 1,B) and taking values in (A ,Ba) is a mapping or 
function / : O — > A with the property that 

if F G B a , then f~\ F ) = {w : /(w) G F} e B. (1.10) 

The name “random variable” is commonly associated with the special case where 
A is the real line and B the Borel field, the smallest cr-field containing all the 
intervals. Occasionally a more general sounding name such as “random object” 
is used for a measurable function to implicitly include random variables ( A the 
real line), random vectors ( A a Euclidean space), and random processes (A a 
sequence or waveform space). We will use the terms “random variable” in the 
more general sense. 

A random variable is just a function or mapping with the property that 
inverse images of “output events” determined by the random variable are events 
in the original measurable space. This simple property ensures that the output 
of the random variable will inherit its own probability measure. For example, 
with the probability measure Pf defined by 

P f {B) = P{f~\B)) = P{u : ,/H £B); Be B A , 




4 



CHAPTER!. INFORMATION SOURCES 



( A,Ba,P / ) becomes a probability space since measurability of / and elemen- 
tary set theory ensure that Pf is indeed a probability measure. The induced 
probability measure Pf is called the distribution of the random variable /. The 
measurable space ( A,Ba ) or, simply, the sample space A, is called the alphabet 
of the random variable f. We shall occasionally also use the notation P/ _1 
which is a mnemonic for the relation Pf~ 1 (F ) = P(f~ 1 (F)) and which is less 
awkward when / itself is a function with a complicated name, e.g., Hx^m- 

If the alphabet A of a random variable / is not clear from context, then we 
shall refer to / as an A-valued random variable. If / is a measurable function 
from (f l,B) to (A,Ba), we will say that / is Z?/£> 4 -measurable if the cr-fields 
might not be clear from context. 

Given a probability space (fi, B,P), a collection of subsets Q is a sub-cr-field 
if it is a (7-field and all its members are in B. A random variable / : Cl — > A 
is said to be measurable with respect to a sub-cr-field Q if / -1 (IZ) £ Q for all 
H £ Ba- 

Given a probability space (Cl, B , P ) and a sub-cr-field Q, for any event H £ B 
the conditional probability m(H\Q) is defined as any function, say g , which 
satisfies the two properties 



g is measurable with respect to Q 


(1.11) 


1 ghdP = m(Gf]H)- all G £ Q. 


(1.12) 



An important special case of conditional probability occurs when studying 
the distributions of random variables defined on an underlying probability space. 
Suppose that X : Cl — > Ax and Y : Cl — > Ay are two random variables defined 
on (Cl, B, P) with alphabets A x and Ay and cr-fields Ba x and B, 4 V , respectively. 
Let Pxy denote the induced distribution on (A x x A y ,Ba x x Ba y ), that is, 
P XV (F x G) = P(X £ F,Y £ G) = P(X- 1 (F)f]Y~ 1 (G)). Let a(Y) denote 
the sub-cr-field of B generated by Y, that is, Y~ 1 (Ba y )- Since the conditional 
probability P(F\a(Y )) is real-valued and measurable with respect to cr(Y), it 
can be written as g(Y(cv)), u £ Cl, for some function g(y). (See, for example, 
Lemma 5.2.1 of [50].) Define P(F\y) = g(y). For a fixed F £ Ba x define the 
conditional distribution of F given Y = y by 

Px\Y(F\y) = P(X~\F)\y)- y £ B Ay - 
From the properties of conditional probability, 

P X y(F xG)= [ P xlY (F\y)dP Y (y)-,F £ B Ax ,G £ B Ay . (1.13) 

JG 

It is tempting to think that for a fixed y, the set function defined by 
Px\v(F\y); F € Ba x is actually a probability measure. This is not the case in 
general. When it does hold for a conditional probability measure, the condi- 
tional probability measure is said to be regular. As will be emphasized later, this 
text will focus on standard alphabets for which regular conditional probabilites 
always exist. 
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1.3 Random Processes and Dynamical Systems 

We now consider two mathematical models for a source: A random process 
and a dynamical system. The first is the familiar one in elementary courses, a 
source is just a random process or sequence of random variables. The second 
model is possibly less familiar; a random process can also be constructed from 
an abstract dynamical system consisting of a probability space together with a 
transformation on the space. The two models are connected by considering a 
time shift to be a transformation. 

A discrete time random process or for our purposes simply a random process 
is a sequence of random variables {X n } ne q- or {X n ;n £ T}, where T is an 
index set, defined on a common probability space (O, B, P). We define a source 
as a random process, although we could also use the alternative definition of 
a dynamical system to be introduced shortly. We usually assume that all of 
the random variables share a common alphabet, say A. The two most common 
index sets of interest are the set of all integers Z = {• • • , —2, —1,0, 1,2, • • •}, 
in which case the random process is referred to as a two-sided random process, 
and the set of all nonnegative integers Z + = {0,1,2, •••}, in which case the 
random process is said to be one-sided. One-sided random processes will often 
prove to be far more difficult in theory, but they provide better models for 
physical random processes that must be “turned on” at some time or which 
have transient behavior. 

Observe that since the alphabet A is general, we could also model continuous 
time random processes in the above fashion by letting A consist of a family of 
waveforms defined on an interval, e.g., the random variable X n could in fact be 
a continuous time waveform X(t) for t £ [ nT , (n + 1)2”), where T is some fixed 
positive real number. 

The above definition does not specify any structural properties of the index 
set T . In particular, it does not exclude the possibility that T be a finite set, in 
which case “random vector” would be a better name than “random process.” In 
fact, the two cases of T = Z and T = Z + will be the only important examples 
for our purposes. Nonetheless, the general notation of T will be retained in 
order to avoid having to state separate results for these two cases. 

An abstract dynamical system consists of a probability space (f l,B,P) to- 
gether with a measurable transformation T : 0 — > O of O into itself. Measura- 
bility means that if F £ B, then also T _1 F = {to : Tu £ F}£ B. The quadruple 
(fl,£>,P,T) is called a dynamical system in ergodic theory. The interested reader 
can find excellent introductions to classical ergodic theory and dynamical system 
theory in the books of Halmos [62] and Sinai [136]. More complete treatments 
may be found in [15], [131], [124], [30], [147], [116], [42]. The term “dynamical 
systems” comes from the focus of the theory on the long term “dynamics” or 
“dynamical behavior” of repeated applications of the transformation T on the 
underlying measure space. 

An alternative to modeling a random process as a sequence or family of 
random variables defined on a common probability space is to consider a sin- 
gle random variable together with a transformation defined on the underlying 
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probability space. The outputs of the random process will tlreu be values of the 
random variable taken on transformed points in the original space. The trans- 
formation will usually be related to shifting in time and hence this viewpoint will 
focus on the action of time itself. Suppose now that T is a measurable mapping 
of points of the sample space fi into itself. It is easy to see that the cascade or 
composition of measurable functions is also measurable. Hence the transforma- 
tion T n defined as T 2 lo = T(Tu>) and so on (T n u = T(T”^ 1 w)) is a measurable 
function for all positive integers n. If / is an H-valued random variable defined 
on (fi, B), then the functions fT n : O — ■> A defined by fT n (u) — f(T n u>) for 
u € fl will also be random variables for all n in Z + . Thus a dynamical system 
together with a random variable or measurable function / defines a one-sided 
random process {X n } neZ+ by X n (u > ) = f(T n cu). If it should be true that T is 
invertible, that is, T is one-to-one and its inverse T~ x is measurable, their one 
can define a two-sided random process by X n (u) = f{T n u > ), all n in Z. 

The most common dynamical system for modeling random processes is that 
consisting of a sequence space fl containing all one- or two-sided A - valued se- 
quences together with the shift transformation T, that is, the transformation 
that maps a sequence {x„} into the sequence {x„_|_i} wherein each coordinate 
has been shifted to the left by one time unit. Thus, for example, let O = A z + 
= {all x = (xo, x\, ■ • •) with Xi £ A for all z} and define T : O — ■> fi by 
T[x o, Xi,X2, ■ ■ •) = (xi, X2, X3, • • •). T is called the shift or left shift, transforma- 
tion on the one-sided sequence space. The shift for two-sided spaces is defined 
similarly. 

The different models provide equivalent models for a given process: one 
emphasizing the sequence of outputs and the other emphasising the action of a 
transformation on the underlying space in producing these outputs. In order to 
demonstrate in what sense the models are equivalent for given random processes, 
we next turn to the notion of the distribution of a random process. 



1.4 Distributions 

While in principle all probabilistic quantities associated with a random process 
can be determined from the underlying probability space, it is often more con- 
venient to deal with the induced probability measures or distributions on the 
space of possible outputs of the random process. In particular, this allows us to 
compare different random processes without regard to the underlying probabil- 
ity spaces and thereby permits us to reasonably equate two random processes 
if their outputs have the same probabilistic structure, even if the underlying 
probability spaces are quite different. 

We have already seen that each random variable X n of the random process 
{X n } inherits a distribution because it is measurable. To describe a process, 
however, we need more than simply probability measures on output values of 
separate single random variables; we require probability measures on collections 
of random variables, that is, on sequences of outputs. In order to place prob- 
ability measures on sequences of outputs of a random process, we first must 
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construct the appropriate measurable spaces. A convenient technique for ac- 
complishing this is to consider product spaces, spaces for sequences formed by 
concatenating spaces for individual outputs. 

Let T denote any finite or infinite set of integers. In particular, T = Z(n) = 
{0, 1, 2, • • • , n — 1}, T = Z, or T = Z + . Define x T = {xi}i^r- For example, 
x z = (• • • , X-±,xo, Xi, ■ ■ •) is a two-sided infinite sequence. When T = Z(n) we 
abbreviate x 2 ^ to simply x n . Given alphabets Ai, i £ T , define the cartesian 
product space 

x Ai = { all x T : Xi,£ Ai all i in T} . 

*er 

In most cases all of the Ai will be replicas of a single alphabet A and the above 
product will be denoted simply by A T . Thus, for example, is 

the space of all possible outputs of the process from time m to time n; A z 
is the sequence space of all possible outputs of a two-sided process. We shall 
abbreviate the notation for the space A z ( n \ the space of all n dimensional 
vectors with coordinates in A, by A n . 

To obtain useful cr-fields of the above product spaces, we introduce the idea of 
a rectangle in a product space. A rectangle in AJ taking values in the coordinate 
cr-fields Bi , i € J, is defined as any set of the form 

B = {x r £ A t : Xi € Bi\ all i in J }, (1.14) 

where J is a finite subset of the index set T and Bi £ Bi for all i £ J . 
(Hence rectangles are sometimes referred to as finite dimensional rectangles.) A 
rectangle as in (1.4.1) can be written as a finite intersection of one-dimensional 
rectangles as 



B = f| {x T £ A t : Xi £ Bi} = f) X~\B,) (1.15) 

i£j i&J 

where here we consider Xi as the coordinate functions Xi : A T — > A defined by 

Xi ( X ) Xi . 

As rectangles in A T are clearly fundamental events, they should be members 
of any useful a - field of subsets of A T . Define the product a - field Ba T as the 
smallest cr-field containing all of the rectangles, that is, the collection of sets that 
contains the clearly important class of rectangles and the minimum amount of 
other stuff required to make the collection a cr-field. To be more precise, given 
an index set T of integers, let RECT{Bi,i £ T) denote the set of all rectangles 
in A T taking coordinate values in sets in Bi , i £ T . We then define the product 
cr-field of A T by 

B a t = a(RECT(Bi,i £ T)). (1.16) 

Consider an index set T and an A-valued random process {X n } n£ T defined 
on an underlying probability space (fl,H,P). Given any index set J C T , 
measurability of the individual random variables X n implies that of the random 
vectors X J = {X n ;n £ J}. Thus the measurable space (A J ,Ba J ) inherits a 
probability measure from the underlying space through the random variables 
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. Thus in particular the measurable space (A t ,Ba T ) inherits a probability 
measure from the underlying probability space and thereby determines a new 
probability space (A T ,B A T , Px T )> where the induced probability measure is 
defined by 

P x r(F) = P((X T )- 1 (^)) = P{w : X T (co) G F); F G B A T . (1.17) 

Such probability measures induced on the outputs of random variables are re- 
ferred to as distributions for the random variables, exactly as in the simpler case 
first treated. When T = {to, to + 1, • • • , to + n — 1}, e.g., when we are treating 
X™ = (X n , ■ ■ ■ , X m+n _i) taking values in A n , the distribution is referred to 
as an n-dimensional or ?rth order distribution and it describes the behavior of 
an n-dimensional random variable. If T is the entire process index set, e.g., if 
T = Z for a two-sided process or T = Z + for a one-sided process, then the 
induced probability measure is defined to be the distribution of the process. 
Thus, for example, a probability space (f l,B,P) together with a doubly infi- 
nite sequence of random variables {X n } n£ z induces a new probability space 
(A z ,B a Z , P\ z ) and Px z is distribution of the process. For simplicity, let 
us now denote the process distribution simply by m. We shall call the proba- 
bility space ( A T ,B A T ,m . ) induced in this way by a random process {X n } ne z 
the output space or sequence space of the random process. 

Since the sequence space {A T ,Ba T , to) of a random process {X n } n& z is a 
probability space, we can define random variables and hence also random pro- 
cesses on this space. One simple and useful such definition is that of a sampling 
or coordinate or projection function defined as follows: Given a product space 
A T , define the sampling functions II„ : A T — > A by 

Hn{x T ) = x n ,x T G A T ; n G T. (1.18) 

The sampling function is named II since it is also a projection. Observe that the 
distribution of the random process {II n } n eT defined on the probability space 
(A t ,B a t , to) is exactly the same as the distribution of the random process 
{X n } ne t defined on the probability space (Q, B, P). In fact, so far they are the 
same process since the {II„} simply read off the values of the {X n }. 

What happens, however, if we no longer build the H n on the X n , that is, we 
no longer first select ui from fl according to P, then form the sequence x T = 
X t (lu) = {XnfwJJngr, and then define II n (:r r ) = X n (io)l Instead we directly 
choose an x in A T using the probability measure m and then view the sequence 
of coordinate values. In other words, we are considering two completely separate 
experiments, one described by the probability space P) and the random 

variables {X n } and the other described by the probability space (A T ,B A T , to) 
and the random variables {II„}. In these two separate experiments, the actual 
sequences selected may be completely different. Yet intuitively the processes 
should be the “same” in the sense that their statistical structures are identical, 
that is, they have the same distribution. We make this intuition formal by 
defining two processes to be equivalent if their process distributions are identical, 
that is, if the probability measures on the output sequence spaces are the same, 
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regardless of the functional form of the random variables of the underlying 
probability spaces. In the same way, we consider two random variables to be 
equivalent if their distributions are identical. 

We have described above two equivalent processes or two equivalent models 
for the same random process, one defined as a sequence of random variables 
on a perhaps very complicated underlying probability space, the other defined 
as a probability measure directly on the measurable space of possible output 
sequences. The second model will be referred to as a directly given random 
process. 

Which model is “better” depends on the application. For example, a directly 
given model for a random process may focus on the random process itself and not 
its origin and hence may be simpler to deal with. If the random process is then 
coded or measurements are taken on the random process, then it may be better 
to model the encoded random process in terms of random variables defined on 
the original random process and not as a directly given random process. This 
model will then focus on the input process and the coding operation. We shall 
let convenience determine the most appropriate model. 

We can now describe yet another model for the above random process, that 
is, another means of describing a random process with the same distribution. 
This time the model is in terms of a dynamical system. Given the probability 
space (A T ,Ba T , m), define the (left) shift transformation T : A T — > A T by 

T{x r ) = T({x n } ne r) =y T = {ynjneT, 

where 

y n = x n+ i,n G T. 

Thus the ?ztlr coordinate of y T is simply the (n + l)st coordinate of x T . (We 
assume that T is closed under addition and hence if n and 1 are in T, then so 
is (n + 1).) If the alphabet of such a shift is not clear from context, we will 
occasionally denote the shift by Ta or Tat. The shift can easily be shown to 
be measurable. 

Consider next the dynamical system {A T ,Ba T , P,T) and the random pro- 
cess formed by combining the dynamical system with the zero time sampling 
function IIo (we assume that 0 is a member of T ). If we define Y n (x ) = no(T n a;) 
for x = x T G A T , or, in abbreviated form, Y n = IIoT", then the random pro- 
cess {Y n } ne r is equivalent to the processes developed above. Thus we have 
developed three different, but equivalent, means of producing the same random 
process. Each will be seen to have its uses. 

The above development shows that a dynamical system is a more fundamen- 
tal entity than a random process since we can always construct an equivalent 
model for a random process in terms of a dynamical system-use the directly 
given representation, shift transformation, and zero time sampling function. 

The shift transformation on a sequence space introduced above is the most 
important transformation that we shall encounter. It is not, however, the only 
important transformation. When dealing with transformations we will usually 
use the notation T to reflect the fact that it is often related to the action of a 
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simple left shift of a sequence, yet it should be kept in mind that occasionally 
other operators will be considered and the theory to be developed will remain 
valid, even if T is not required to be a simple time shift. For example, we will 
also consider block shifts. 

Most texts on ergodic theory deal with the case of an invertible transforma- 
tion, that is, where T is a one-to-one transformation and the inverse mapping 
T -1 is measurable. This is the case for the shift on A z , the two-sided shift. It is 
not the case, however, for the one-sided shift defined on A z + and hence we will 
avoid use of this assumption. We will, however, often point out in the discussion 
what simplifications or special properties arise for invertible transformations. 

Since random processes are considered equivalent if their distributions are 
the same, we shall adopt the notation [A, m, X] for a random process { X n ; n € 
T} with alphabet A and process distribution m, the index set T usually being 
clear from context. We will occasionally abbreviate this to the more common 
notation [ A , m], but it is often convenient to note the name of the output ran- 
dom variables as there may be several, e.g., a random process may have an 
input X and output Y. By “the associated probability space” of a random 
process [A,m.,X] we shall mean the sequence probability space (A T ,Ba T ,m). 
It will often be convenient to consider the random process as a directly given 
random process, that is, to view X n as the coordinate functions II n on the se- 
quence space A T rather than as being defined on some other abstract space. 
This will not always be the case, however, as often processes will be formed by 
coding or communicating other random processes. Context should render such 
bookkeeping details clear. 

1.5 Standard Alphabets 

A measurable space (A, Ba) is a standard space if there exists a sequence of 
finite fields T n \ n = 1, 2, • • • with the following properties: 

(1) T n C T n + 1 (the fields are increasing). 

(2) Ba is the smallest cr-helcl containing all of the T n (the T n generate Ba or 

(3) An event G n £ T n is called an atom of the held if it is nonempty and and 
its only subsets which are also held members are itself and the empty set. 
If G n € T n \ n = 1, 2, • • • are atoms and G n+ 1 C G n for all n, then 

OO 

f| ^ 0 - 

n=l 

Standard spaces are important for several reasons: First, they are a general class 
of spaces for which two of the key results of probability hold: (1) the Kolmogorov 
extension theorem showing that a random process is completely described by its 
finite order distributions, and (2) the existence of regular conditional probability 
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measures. Thus, in particular, the conditional probability measure P\ \r(F\y) 
of (1-13) is regular if the alphabets Ax and Ay are standard and hence for each 
fixed y £ Ay the set function Px\y{F\y)\ F € Ba x is a probability measure. 
In this case we can interpret Px\y{F\y) as P(X £ F\Y = y). Second, the 
ergodic decomposition theorem of ergodic theory holds for such spaces. Third, 
the class is sufficiently general to include virtually all examples arising in ap- 
plications, e.g., discrete spaces, the real line, Euclidean vector spaces, Polish 
spaces (complete separable metric spaces), etc. The reader is referred to [50] 
and the references cited therein for a detailed development of these properties 
and examples of standard spaces. 

Standard spaces are not the most general space for which the Kolmogorov 
extension theorem, the existence of conditional probability, and the ergodic 
decomposition theorem hold. These results also hold for perfect spaces which 
include standard spaces as a special case. (See, e.g., [128], [139], [126], [98].) We 
limit discussion to standard spaces, however, as they are easier to characterize 
and work with and they are sufficiently general to handle most cases encountered 
in applications. Although standard spaces are not the most general for which the 
required probability theory results hold, they are the most general for which all 
finitely additive normalized measures extend to countably additive probability 
measures, a property which greatly eases the proof of many of the desired results. 

Throughout this book we shall assume that the alphabet A of the information 
source is a standard space. 



1.6 Expectation 

Let (f 1,13, m) be a probability space, e.g., the probability space of a directly 
given random process with alphabet A, (A 7 ", Ba T , m). A real- valued random 
variable / : fi — > R, will also be called a measurement since it is often formed 
by taking a mapping or function of some other set of more general random 
variables, e.g., the outputs of some random process which might not have real- 
valued outputs. Measurements made on such processes, however, will always be 
assumed to be real. 

Suppose next we have a measurement / whose range space or alphabet 
m) c r of possible values is finite. Then / is called a discrete random 
variable or discrete measurement or digital measurement or, in the common 
mathematical terminology, a simple function. 

Given a discrete measurement /, suppose that its range space is /( fi) = 
{bi,i = 1 , ■••.,1V’}, where the bi are distinct. Define the sets Fj = / _1 (^i) = 
{x : f(x ) = bi}, i = 1, • • • , N. Since / is measurable, the Fi are all members 
of B. Since the bi are distinct, the Fj are disjoint. Since every input point in 
f l must map into some bi, the union of the Fi equals fi. Thus the collection 
{F,;; i = 1, 2, • • • , N} forms a partition of fi. We have therefore shown that any 




12 



CHAPTER!. INFORMATION SOURCES 



discrete measurement / can be expressed in the form 

M 

f( x ) = '52 b i 1 F i (x), (1.19) 

i= 1 

where 6,; £ R, the E) £ B form a partition of Q, and l/? 4 is the indicator function 
of Fi, i = 1, ■ • • , M. Every simple function has a unique representation in this 
form with distinct bi and {i 7 ,} a partition. 

The expectation or ensemble average or probabilistic average or mean of a 
discrete measurement / : il —> R as in (1.6.1) with respect to a probability 
measure in is defined by 

M 

E m f = Y J b i m{F i ). (1.20) 

i—0 

An immediate consequence of the definition of expectation is the simple but 
useful fact that for any event F in the original probability space, 

E m lF = m(F), 

that is, probabilities can be found from expectations of indicator functions. 

Again let (f be a probability space and / : fi — > TZ a measurement, 
that is, a real-valued random variable or measurable real-valued function. Define 
the sequence of quantizers q n : 72. — > 7?., n = 1, 2, ■ ■ as follows: 

! n n < r 

(k - 1)2-" (, k - 1)2-" <r< k2~ n ; k = 1, 2, • • • , n2" 

-(A; -1)2"" —k2~ n <r< —(k — 1)2 _ "; k = 1, 2, • • • , n2" 

— n r < — n . 

We now define expectation for general measurements in two steps. If / > 0, 
then define 

E m f = lim E m (q n (f)). (1.21) 

n— >oo 

Since the q n are discrete measurements on /, the q n (f) are discrete measure- 
ments on O (■ q n (f)(x ) = q n {f{x)) is a simple function) and hence the individual 
expectations are well defined. Since the q n (f) are nondecreasing, so are the 
Em(qn(f)) and this sequence must either converge to a finite limit or grow 
without bound, in which case we say it converges to oo. In both cases the 
expectation E m f is well defined, although it may be infinite. 

If / is an arbitrary real random variable, define its positive and negative parts 
f + {x) = max(/(a;),0) and f~(x) = -min(/(x),0) so that f(x) = f + (x)-f~(x) 
and set 

Emf = E m f+ - E m f~ (1.22) 

provided this does not have the form +oo — oo, in which case the expectation 
does not exist. It can be shown that the expectation can also be evaluated for 
nonnegative measurements by the formula 

E m f — sup E m g. 
discrete g ■. g<f 
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The expectation is also called an integral and is denoted by any of the fol- 
lowing: 



E m f 




f(x)dm(x) 



f (x)m(dx) . 



The subscript m denoting the measure with respect to which the expectation is 
taken will occasionally be omitted if it is clear from context. 

A measurement / is said to be integrable or m-integrable if E m f exists and 
is finite. A function is integrable if and only if its absolute value is integrable. 
Define A 1 (to) to be the space of all m-integrable functions. Given any to- 
integrable / and an event B, define 




f(x)l B (x) dm(x). 



Two random variables / and g are said to be equal ?n-almost-everywhere 
or equal m-a.e. or equal with ?n-probability one if to(/ = g) = m({x : f(x) = 
g(x)}) = 1. The to- is dropped if it is clear from context. 

Given a probability space (O suppose that Q is a sub-cr-field of B , 
that is, it is a er-field of subsets of f l and all those subsets are in B (Q C B). 
Let / : Q — > 1Z be an integrable measurement. Then the conditional expectation 
E(f\Q) is described as any function, say h(ui), that satisfies the following two 
properties: 

h(uj) is measurable with respect to Q (1.23) 



[ hdm= [ f dm; all G € Q. (1-24) 

jg JG 

If a regular conditional probability distribution given Q exists, e.g., if the 
space is standard, then one has a constructive definition of conditional expecta- 
tion: E(f\Q)(u) is simply the expectation of / with respect to the conditional 
probability measure m(.\Q)(u>). Applying this to the example of two random 
variables X and Y with standard alphabets described in Section 1.2 we have 
from (1.24) that for integrable / : Ax x Ay —> TZ 

E{f) = J f(x,y)dP X Y(x,y) = J (j f{x,y)dP x \ Y {x\y))dP Y {y). (1.25) 

In particular, for fixed y , f(x , y) is an integrable (and measurable) function of 

x. 

Equation (1.25) provides a generalization of (1.13) from rectangles to arbi- 
trary events. For an arbitrary F € Ba x xA y we have that 

Pxy(F) = J J(l F (x,y)dP x \ Y (x\y))dPy(y) = J P x \ Y (F y \y)dP Y (y), (1.26) 

where F y = {x : (x,y) € F} is called the section of F at y. If F is measurable, 
then so is F y for all y. 
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The inner integral is just 

f dP x \ Y (x\y) = P x \Y{F v \y), 

J x:(x,y)£F 

where the set F y = {x : (x,y) £ F} is called the section of F at y. Since 
1 p(x,y) is measurable with respect to x for each fixed y, F y £ Ba x - 

1.7 Asymptotic Mean Stationarity 

A dynamical system (or the associated source) (fi, B, P , T) is said to be station- 
ary if 

P(T~ l G) = P(G) 

for all G £ B. It is said to be asymptotically mean stationary or, simply, AMS 
if the limit 

1 n— 1 

P{G) = lim - V P(T~ k G) (1.27) 

n—>oo Tl J 
k — 0 

exists for all G £ B. The following theorems summarize several important 
properties of AMS sources. Details may be found in Chapter 6 of [50] . 

Theorem 1 . 7 . 1 : If a dynamical system (fi, B, P,T) is AMS, then P defined 
in (1.7.1) is a probability measure and (fi, B, P, T) is stationary. (P is called the 
stationary mean of P.) If an event G is invariant in the sense that T~ 1 G = G, 
then 

P(G) = P(G). 

If a random variable g is invariant in the sense that g(Tx ) = g{x) with P 
probability 1, then 

E P g = Epg. 

The stationary mean P asymptotically dominates P in the sense that if P(G) 
= 0, then 

lim sup P(T~ n G) = 0. 

n—> oo 

Theorem 1 . 7 . 2 : Given an AMS source {X n } let a(X n , X n+ i, ■ ■ ■) denote 
the a - field generated by the random variables X n , • • •, that is, the smallest a- 
field with respect to which all these random variables are measurable. Define 
the tail a-field by 

OO 

Foo = n *(*»»■■■)■ 

n— 0 

If G £ Foo and P(G) = 0, then also P(G) = 0. 

The tail u-field can be thought of as events that are determinable by looking 
only at samples of the sequence in the arbitrarily distant future. The theorem 
states that the stationary mean dominates the original measure on such tail 
events in the sense that zero probability under the stationary mean implies zero 
probability under the original source. 
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1.8 Ergodic Properties 

Two of the basic results of ergodic theory that will be called upon extensively 
are the pointwise or almost-everywhere ergodic theorem and the ergodic decom- 
position theorem We quote these results along with some relevant notation for 
reference. Detailed developments may be found in Chapters 6-8 of [50]. The 
ergodic theorem states that AMS dynamical systems (and hence also sources) 
have convergent sample averages, and it characterizes the limits. 

Theorem 1.8.1: If a dynamical system (fi, B , m, T ) is AMS with stationary 
mean to and if / £ L 1 ( fh ), then with probability one under m and to 

1 “ 



lim - Y^fT = E m {f\I), 

► oo n z ' 



i=0 

where X is the sub-tr-field of invariant events, that is, events G for which T~ 1 G = 
G. 

The basic idea of the ergodic decomposition is that any stationary source 
which is not ergodic can be represented as a mixture of stationary ergodic com- 
ponents or subsources. 

Theorem 1.8.2: Given the standard sequence space with shift T as 

previously, there exists a family of stationary ergodic measures {p x ; x G fi}, 
called the ergodic decomposition, with the following properties: 

(a) ptx = Px- 

(b) For any stationary measure m, 



(c) For any g £ L l (m) 



i(G) = J p x {G) dm{x)-, all G £ B. 
J gdm = J (J gdp x )dm{x). 



It is important to note that the same collection of stationary ergodic components 
works for any stationary measure to. This is the strong form of the ergodic 
decomposition. 

The final result of this section is a variation on the ergodic decomposition 
that will be useful. To describe the result, we need to digress briefly to introduce 
a metric on spaces of probability measures. A thorough development can be 
found in Chapter 8 of [50]. We have a standard sequence measurable space 
(O ,B) and hence we can generate the er-field B by a countable field T = {F n : 
n = 1, 2 f >: - •}. Given such a countable generating field, a distributional distance 
between two probability measures p and to on (Q, B) is defined by 



d(p, to) = V' 2 "| p(F n ) - m(F n )\. 
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Any choice of a countable generating field yields a distributional distance. Such 
a distance or metric yields a measurable space of probability measures as follows: 
Let A denote the space of all probability measures on the original measurable 
space (12,13). Let B( A) denote the er-field of subsets of A generated by all 
open spheres using the distributional distance, that is, all sets of the form {p : 
d(p, m) < e} for some m. € A and some e > 0. We can now consider properties of 
functions that carry sequences in our original space into probability measures. 
The following is Theorem 8.5.1 of [50]. 

Theorem 1.8.3: Fix a standard measurable space (12, B) and a transforma- 
tion T : 12 — > 12. Then there are a standard measurable space (A, C), a family of 
stationary ergodic measures {my A € A} on (12, B), and a measurable mapping 
ip : 12 — ■> A such that 

(a) ip is invariant ( ip(Tx ) = ip(x) all a;); 

(b) if m is a stationary measure on (12, B) and is the induced distribution; 
that is, P^(G) = m(ip~ 1 (G)) for G € A (which is well defined from (a)), 
then 

m(F) = j dm(x)m^ x )(F) = J dP^(X)m\(F), all F G B, 
and if / € L l (m), then so is f fdm\ P^-a.e. and 

E m f = J dm(x)E m ^ x) f = J dP^(X)E mx f. 



Finally, for any event F, m^(F) = m(F\ip), that is, given the ergodic 
decomposition and a stationary measure m , the ergodic component A is 
a version of the conditional probability under m given ip = X. 

The following corollary to the ergodic decomposition is Lemma 8.6.2 of [50]. 
It states that the conditional probability of a future event given the entire past 
is unchanged by knowing the ergodic component in effect. This is because the 
infinite past determines the ergodic component in effect. 

Corollary 1.8.1: Suppose that {X n } is a two-sided stationary process with 
distribution m and that {my X £ A} is the ergodic decomposition and ip the 
ergodic component function. Then the mapping ip is measurable with respect 
to o ( X— i , X— 2 , • * * ) and 



fn((Xo, X\,- ■ •) g F|X_i,X_ 2 , • • •) 
m$((X o,X lt ---) <E F \X_!,X_ 2 , ■■■)■, m-a.e. 




Chapter 2 



Entropy and Information 



2.1 Introduction 

The development of the idea of entropy of random variables and processes by 
Claude Shannon provided the beginnings of information theory and of the mod- 
ern age of ergodic theory. We shall see that entropy and related information 
measures provide useful descriptions of the long term behavior of random pro- 
cesses and that this behavior is a key factor in developing the coding theorems 
of information theory. We now introduce the various notions of entropy for ran- 
dom variables, vectors, processes, and dynamical systems and we develop many 
of the fundamental properties of entropy. 

In this chapter we emphasize the case of finite alphabet random processes 
for simplicity, reflecting the historical development of the subject. Occasionally 
we consider more general cases when it will ease later developments. 



2.2 Entropy and Entropy Rate 

There are several ways to introduce the notion of entropy and entropy rate. 
We take some care at the beginning in order to avoid redefining things later. 
We also try to use definitions resembling the usual definitions of elementary 
information theory where possible. Let (Q,B, P,T) be a dynamical system. 
Let / be a finite alphabet measurement (a simple function) defined on f l and 
define the one-sided random process /„ = /T”; n = 0 , 1 , 2 ,-- This process 
can be viewed as a coding of the original space, that is, one produces successive 
coded values by transforming (e.g., shifting) the points of the space, each time 
producing an output symbol using the same rule or mapping. In the usual 
way we can construct an equivalent directly given model of this process. Let 
A = {01,02, • • • , d||^||} denote the finite alphabet of / and let (A z + , B^ + ) be the 
resulting one-sided sequence space, where Ba is the power set. We abbreviate 
the notation for this sequence space to ( A °°,B^). Let Ta denote the shift 
on this space and let X denote the time zero sampling or coordinate function 
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and define X n (x) = X(Tffx) = x n . Let m denote the process distribution 
induced by the original space and the fT n , i.e., m = Pj = P/ _1 where f(u>) = 

Observe that by construction, shifting the input point yields an output se- 
quence that is also shifted, that is, 

f(Tu) = T A j M. 

Sequence-valued measurements of this form are called stationary or invariant 
codings (or time invariant or shift invariant codings in the case of the shift) 
since the coding commutes with the transformations. 

The entropy and entropy rates of a finite alphabet measurement depend 
only on the process distributions and hence are usually more easily stated in 
terms of the induced directly given model and the process distribution. For the 
moment, however, we point out that the definition can be stated in terms of 
either system. Later we will see that the entropy of the underlying system is 
defined as a supremum of the entropy rates of all finite alphabet codings of the 
system. 

The entropy of a discrete alphabet random variable / defined on the proba- 
bility space ( ) is defined by 

Hp(f) = -Y / P (f = a )^ P (f = a )- (2- 1 ) 

aeA 

We define OlnO to be 0 in the above formula. We shall often use logarithms 
to the base 2 instead of natural logarithms. The units for entropy are “nats” 
when the natural logarithm is used and “bits” for base 2 logarithms. The 
natural logarithms are usually more convenient for mathematics while the base 2 
logarithms provide more intuitive descriptions. The subscript P can be omitted 
if the measure is clear from context. Be forewarned that the measure will 
often not be clear from context since more than one measure may be under 
consideration and hence the subscripts will be required. A discrete alphabet 
random variable / has a probability mass function (pmf), say pp, defined by 
Pf(a) = P(f = a) = P({uj : f(ui) = a}) and hence we can also write 

H (/) = - pi (°) ln pi (°) • 

aeA 

It is often convenient to consider the entropy not as a function of the par- 
ticular outputs of / but as a function of the partition that / induces on fi. In 
particular, suppose that the alphabet of / is A = {ai, a 2 , • • • , a||A||} and define 
the partition Q = {Qi\i= 1, 2, • • • , | \A\ |} by Q z = {w : f(u>) = a*} = / _1 ({a*}). 
In other words, Q consists of disjoint sets which group the points in 0 together 
according to what output the measurement / produces. We can consider the 
entropy as a function of the partition and write 

I All 

H P {Q) = ~Y,P{Qi)^P{Qi). 

i= 1 



(2.2) 
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Clearly different mappings with different alphabets can have the same entropy 
if they induce the same partition. Both notations will be used according to the 
desired emphasis. We have not yet defined entropy for random variables that 
do not have discrete alphabets; we shall do that later. 

Return to the notation emphasizing the mapping / rather than the partition. 
Defining the random variable P(f) by P(/)(w) = P ( A : /(A) = f(u>)) we can 
also write the entropy as 



H P (f) = E P (-lnP(f)). 

Using the equivalent directly given model we have immediately that 

H P (f) = H P (Q) = H m (X 0 ) = E m (-lnm(X 0 )). (2.3) 

At this point one might ask why we are carrying the baggage of notations 
for entropy in both the original space and in the sequence space. If we were 
dealing with only one measurement / (or X n ), we could confine interest to the 
simpler directly-given form. More generally, however, we will be interested in 
different measurements or codings on a common system. In this case we will 
require the notation using the original system. Hence for the moment we keep 
both forms, but we shall often focus on the second where possible and the first 
only when necessary. 

The nth order entropy of a discrete alphabet measurement f with respect to 
T is defined as 

H ( P n \f) = n~ 1 H P (f n ) 

where f n = (/, /T, /T 2 , • • • , /T n_1 ) or, equivalently, we define the discrete 
alphabet random process X n (oj) = f(T n cu), then 

/ n yn V V V 

— A — Aq, Ai, • • • , A n _]_. 

As previously, this is given by 

H^(X) = n~ l H m {X n ) = n~ 1 E m {— In m(X n )). 

This is also called the entropy (per-coordinate or per-sample) of the random 
vector f n or X n . We can also use the partition notation here. The partition 
corresponding to f n has a particular form: Suppose that we have two partitions, 
Q — {Qi} and P = {Pi}- Define their join Q.\jV as the partition containing 
all nonempty intersection sets of the form Q, f) Pj . Define also T~ X Q as the 
partition containing the atoms T~ l Qi. Then /" induces the partition 

n— 1 

v T ~ l Q 

i=0 



n — 1 

H^\f) = H£\Q) = n~ 1 H P ( V T~'Q). 

2—0 



and we can write 
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As before, which notation is preferable depends on whether we wish to emphasize 
the mapping / or the partition Q. 

The entropy rate or mean entropy of a discrete alphabet measurement / with 
respect to the transformation T is defined by 

H P (f) = limsup H ( p\f) 

n—> oo 

= H P (Q) = limsup H^p\Q) = H m (X) = limsup H^\X). 

n—* oo n—> oo 

Given a dynamical system (f l,B,P,T), the entropy H(P,T) of the system 
(or of the measure with respect to the transformation) is defined by 

H(P,T ) = sup H P (f) = sup H P ( Q ) , 

/ Q 

where the supremum is over all finite alphabet measurements (or codings) or, 
equivalently, over all finite measurable partitions of O. (We emphasize that 
this means alphabets of size M for all finite values of M.) The entropy of a 
system is also called the Kolmogorov-Sinai invariant of the system because of 
the generalization by Kolmogorov [88] and Sinai [134] of Shannon’s entropy rate 
concept to dynamical systems and the demonstration that equal entropy was a 
necessary condition for two dynamical systems to be isomorphic. 

Suppose that we have a dynamical system corresponding to a finite alphabet 
random process {X n }, then one possible finite alphabet measurement on the 
process is f(x) = Xq, that is, the time 0 output. In this case clearly Hp{f) = 
Hp(X) and hence, since the system entropy is defined as the supremum over 
all simple measurements, 

H(P,T)>H P (X). (2.4) 

We shall later see that (2.4) holds with equality for finite alphabet random 
processes and provides a generalization of entropy rate for processes that do not 
have finite alphabets. 

2.3 Basic Properties of Entropy 

For simplicity we focus on the entropy rate of a directly given finite alphabet 
random process {X„}. We also will emphasize stationary measures, but we will 
try to clarify those results that require stationarity and those that are more 
general. 

Let A be a finite set. Let 12 = A z + and let B be the sigma-field of subsets of 
12 generated by the rectangles. Since A is finite, (A, Ba) is standard, where Ba is 
the power set of A. Thus (12, B) is also standard by Lemma 2.4.1 of [50]. In fact, 
from the proof that cartesian products of standard spaces are standard, we can 
take as a basis for B the fields T n generated by the finite dimensional rectangles 
having the form {a: : X n {x) = x n = a n } for all a n £ A™ and all positive integers 
n. (Members of this class of rectangles are called thin cylinders.) The union of 
all such fields, say T, is then a generating held. 
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Many of the basic properties of entropy follow from the following simple 
inequality. 

Lemma 2.3.1: Given two probability mass functions {p,} and {(?,}, that 
is, two countable or finite sequences of nonnegative numbers that sum to one, 
then 

V'pj In — > 0 
i ® 

with equality if and only if qi = Pi, all i. 

Proof: The lemma follows easily from the elementary inequality for real 
numbers 

In x < x —1 (2.5) 

(with equality if and only if x = 1) since 



22 p ’ ln 







<h 

Pi 



!) = Yl qi ~22 pi = 0 

i i 



with equality if and only if qi/pt = 1 all i. Alternatively, the inequality follows 
from Jensen’s inequality [63] since ln is a convex f) function: 




with equality if and only if qi/pi = 1, all*. □ 

The quantity used in the lemma is of such fundamental importance that we 
pause to introduce another notion of information and to recast the inequality 
in terms of it. As with entropy, the definition for the moment is only for finite 
alphabet random variables. Also as with entropy, there are a variety of ways 
to define it. Suppose that we have an underlying measurable space (f i,B) and 
two measures on this space, say P and M, and we have a random variable / 
with finite alphabet A defined on the space and that Q is the induced partition 
€ A}. Let Pf and Mf be the induced distributions and let p and m be 
the corresponding probability mass functions, e.g., p(a) = P/({a}) = P(f = a). 
Define the relative entropy of a measurement / with measure P with respect to 
the measure M by 



u Pm u) = = X>(«) i» ^ = E m) m 



Observe that this only makes sense if p(a) is 0 whenever m(a) is, that is, if Pf is 
absolutely continuous with respect to Mf or Mf » Pf. Define = oo 

if Pf is not absolutely continuous with respect to Mf. The measure M is re- 
ferred to as the reference measure. Relative entropies will play an increasingly 
important role as general alphabets are considered. In the early chapters the 
emphasis will be on ordinary entropy with similar properties for relative en- 
tropies following almost as an afterthought. When considering more abstract 
(nonfinite) alphabets later on, relative entropies will prove indispensible. 
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Analogous to entropy, given a random process {X n } described by two process 
distributions p and to, if it is true that 

» Px "5 n = 1,2, • • • , 

then we can define for each n the nth order relative entropy n~ 1 H p \\ m (X n ) and 
the relative entropy rate 

H p \\m(X) = limsup -H p \\ m (X n ). 

n—> oo ^ 

When dealing with relative entropies it is often the measures that are impor- 
tant and not the random variable or partition. We introduce a special notation 
which emphasizes this fact. Given a probability space (Cl, B, P), with Cl a finite 
space, and another measure M on the same space, we define the divergence of 
P with respect to M as the relative entropy of the identity mapping with respect 
to the two measures: 

d(p\\m) = 

wen v ’ 

Thus, for example, given a finite alphabet measurement / on an arbitrary prob- 
ability space ( Cl,B,P ), if M is another measure on (Cl,B) then 

H P \ lM (f) = D(P f \\M f ). 



Similarly, 

H pllm (X n ) = D(P X n\\M X n), 

where Px n and Mj» are the distributions for X n induced by process measures p 
and m, respectively. The theory and properties of relative entropy are therefore 
determined by those for divergence. 

There are many names and notations for relative entropy and divergence 
throughout the literature. The idea was introduced by Kullback for applications 
of information theory to statistics (see, e.g., Kullback [92] and the references 
therein) and was used to develop information theoretic results by Perez [120] 
[122] [121], Dobrushin [32], and Pinsker [125]. Various names in common use for 
this quantity are discrimination, discrimination information, Kullback-Leibler 
number, directed divergence, and cross entropy. 

The lemma can be summarized simply in terms of divergence as in the 
following theorem, which is commonly referred to as the divergence inequality. 

Theorem 2.3.1: Given any two probability measures P and M on a com- 
mon finite alphabet probability space, then 

D(P\\M) > 0 (2.6) 

with equality if and only if P = M. 

In this form the result is known as the divergence inequality. The fact that 
the divergence of one probability measure with respect to another is nonnegative 
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and zero only when the two measures are the same suggest the interpretation 
of divergence as a “distance” between the two probability measures, that is, a 
measure of how different the two measures are. It is not a true distance or metric 
in the usual sense since it is not a symmetric function of the two measures and 
it does not satisfy the triangle inequality. The interpretation is, however, quite 
useful for adding insight into results characterizing the behavior of divergence 
and it will later be seen to have implications for ordinary distance measures 
between probability measures. 

The divergence plays a basic role in the family of information measures all 
of the information measures that we will encounter-entropy, relative entropy, 
mutual information, and the conditional forms of these information measures- 
can be expressed as a divergence. 

There are three ways to view entropy as a special case of divergence. The 
first is to permit M to be a general measure instead of requiring it to be a 
probability measure and have total mass 1. In this case entropy is minus the 
divergence if M is the counting measure, i.e., assigns measure 1 to every point 
in the discrete alphabet. If M is not a probability measure, then the divergence 
inequality (2.6) need not hold. Second, if the alphabet of / is Af and has ||^4y|| 
elements, then letting M be a uniform pmf assigning probability 1/||A|| to all 
symbols in A yields 



D(P\\M) = In H-A/ll — Hp(f) > 0 

and hence the entropy is the log of the alphabet size minus the divergence 
with respect to the uniform distribution. Third, we can also consider entropy a 
special case of divergence while still requiring that M be a probability measure 
by using product measures and a bit of a trick. Say we have two measures P and 
Q on a common probability space (f l,B). Define two measures on the product 
space (f2 x O, B(0 x f2)) as follows: Let PxQ denote the usual product measure, 
that is, the measure specified by its values on rectangles as P x Q(F x G) = 
P(F)Q(G). Thus, for example, if P and Q are discrete distributions with pmf’s 
p and q, then the pmf for P x Q is just p(a)q(b). Let P' denote the “diagonal” 
measure defined by its values on rectangles as P'(F x G) = P(Ff)G). In the 
discrete case P' has pmf p/(a, b) = p(a) if a = b and 0 otherwise. Then 

H P (f) = D(P'\\P x P). 

Note that if we let X and Y be the coordinate random variables on our product 
space, then both P' and P x P give the same marginal probabilities to X and 
Y, that is, Px = Py — P- P' is an extreme distribution on (X, Y) in the sense 
that with probability one X = Y ; the two coordinates are deterministically 
dependent on one another. P x P, however, is the opposite extreme in that it 
makes the two random variables X and Y independent of one another. Thus 
the entropy of a distribution P can be viewed as the relative entropy between 
these two extreme joint distributions having marginals P. 
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We now return to the general development for entropy. For the moment fix 
a probability measure m on a measurable space (0,B) and let X and Y be two 
finite alphabet random variables defined on that space. Let A x and Ay denote 
the corresponding alphabets. Let P X y, Px, and Py denote the distributions of 
(X,Y), X, and Y, respectively. 

First observe that since Px{a) < 1, all a, — In P x (a) is positive and hence 
H(X) = - Y, Px(a) In P x (a) > 0. (2.7) 

a&A 



From (2.6) with M uniform as in the second interpretation of entropy above, 
if X is a random variable with alphabet Ax , then 



H(X)<ln\\A x \\. 



Since for any a € A x and b G Ay we have that P X (a) > P X y{a , b), it follows 
that 

H(X,Y) = - ^ P X y (a, b) In P X y (a, b) 

a,b 

> — ^2 Pxy(a , b) lnPx(a) = H(X). 

a,b 

Using Lemma 2.3.1 we have that since P X y and P x Py are probability mass 
functions, 



H(X, Y) - (H(X) + H(Y)) = Y, Pxy(a, b ) In 

a, b v 7 



< 0 . 



This proves the following result: 

Lemma 2.3.2: Given two discrete alphabet random variables X and Y 
defined on a common probability space, we have 

0 < H{X) (2.8) 

and 

ma x(H(X),H(Y)) < H{X, Y) < H(X) + H(Y) (2.9) 

where the right hand inequality holds with equality if and only if X and Y are 
independent. If the alphabet of X has ||-Ax|| symbols, then 

H x (X) <ln\\A x \\. (2.10) 

There is another proof of the left hand inequality in (2.9) that uses an 
inequality for relative entropy that will be useful later when considering codes. 
The following lemma gives the inequality. First we introduce a definition. A 
partition 1Z is said to refine a partion Q if every atom in Q is a union of atoms 
of 1Z , in which case we write Q < 1Z. 
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Lemma 2.3.3: Suppose that P and M are two measures defined on a 
common measurable space and that we are given a finite partitions Q < 

1Z. Then 

Hp\\m(Q) < Hp\\ M (TZ) 



and 

H P (Q)< H P (1Z) 



Comments: The lemma can also be stated in terms of random variables and 
mappings in an intuitive way: Suppose that U is a random variable with finite 
alphabet A and f : A —> B is a mapping from A into another finite alphabet 
B. Then the composite random variable f(U) defined by f(U)(u>) = f(U(ui)) is 
also a finite random variable. If U induces a partition 7 Z and f(U) a partition 
Q, then Q < 1Z (since knowing the value of U implies the value of f{U)). Thus 
the lemma immediately gives the following corollary. 

Corollary 2.3.1 If M » P are two measures describing a random variable 
U with alphabet A and if / : A — > B, then 



H P \\ M (f(U)) < H p \\ m (U) 

and 

H P {f{U)) < H P (U). 

Since D(Pf\\Mf) = £/pmm(/ ), we have also the following corollary which we 
state for future reference. 

Corollary 2.3.2: Suppose that P and M are two probability measures on 
a discrete space and that / is a random variable defined on that space, then 

D(Pf\\Mf) < D(P\\M). 

The lemma, discussion, and corollaries can all be interpreted as saying that 
taking a measurement on a finite alphabet random variable lowers the entropy 
and the relative entropy of that random variable. By choosing U as (X,Y) and 
f(X, Y) = X or Y, the lemma yields the promised inequality of the previous 
lemma. 

Proof of Lemma: II H P \\ M (TZ) = +oo, the result is immediate. If H P \\ M (Q) = 
+oo, that is, if there exists at least one Qj such that M(Qj ) = 0 but P{Qj) yf 0, 
then there exists an Ri C Qj such that M(Rf) = 0 and P(Ri) > 0 and hence 
H p \\ m (K) = +oo. Lastly assume that both H P ^ M (TZ) and H p \i m (Q ) are finite 
and consider the difference 



H p \\m(T^-) ~ H p \\ m {Q) — 



E p (^) ln 



P{Ri) 

M(Rj) 



E P (^)!n 



P(Qj) 

M(Qj) 



£[ E P{Ri) In 

j i'-RiCQj 



P{Ri) 

M(Ri) 



p (Qj) In 



P(Qj) 

M{Qj) 
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We shall show that each of the bracketed terms is nonnegative, which will prove 
the first inequality. Fix j. If P(Qj) is 0 we are done since then also P(Ri) is 0 
for all i in the inner sum since these Ri all belong to Qj. If P(Qj ) is not 0, we 
can divide by it to rewrite the bracketed term as 



p (Qj) 




M i P(Ri)/P{Qj) \ 
P(Qj ) M(Rj)/M(Qj) J 



where we also used the fact that M(Qj ) cannot be 0 since then P(Qj) would 
also have to be zero. Since Ri C Qj , P(Ri)/P{Qj) = P ( R., (~]Qj)/P( Q :1 ) = 
P(Ri\Qj) is an elementary conditional probability. Applying a similar argument 
to M and dividing by P(Qj), the above expression becomes 



^ P(Ri\Qj) In 

i-.RiCQj 



P(Ri\Qj) 

M(Ri\Qj) 



which is nonnegative from Lemma 2.3.1, which proves the first inequality. The 
second inequality follows similarly: Consider the difference 



Hp (n)-Hp{Q) = Y J [ E p (^) ln 

3 i-RiCQj 



pm 

P(Ri) 



E p mh E P (Ri\Q j) In P(Ri\Qj)} 

3 i-.RiCQj 

and the result follows since the bracketed term is nonnegative since it is an 
entropy for each value of j (Lemma, 2.3.2). □ 

The next result provides useful inequalities for entropy considered as a func- 
tion of the underlying distribution. In particular, it shows that entropy is a 
concave (or convex f)) function of the underlying distribution. Define the bi- 
nary entropy function (the entropy of a binary random variable with probability 
mass function (A, 1 — A)) by 

h 2 ( A) = —Ain A - (1 - A) ln(l - A). 



Lemma 2.3.4: Let m and p denote two distributions for a discrete alphabet 
random variable X and let A £ (0, 1). Then for any A £ (0, 1) 



A H m (X) + (1 - A )H P (X) < H Xm+{ i— A )P P0 

< A H m (X) + (1 - X)H P (X) + / 12 (A). (2.11) 

Proof: We do a little extra here to save work in a later result. Define the 
quantities 

/ = — E m ( x ) ln(Am(a;) + (1 — X)p(x)) 



X 
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(2.12) 



(2.13) 



The next result presents an interesting connection between combinatorics 
and binomial sums with a particular entropy. We require the familiar definition 
of the binomial coefficient: 

n! 

k\(n — k)\ 





28 



CHAPTER 2. ENTROPY AND INFORMATION 



Lemma 2.3.5: Given 5 £ (0, b] and a positive integer M, we have 



E 

i<SM 



M 



If 0 < S < p < 1, then 

E 

where 



<5M 



< e Mh 2 (6)_ 



M-i < -Mh 2 {5\\p) 



p\l-p) M ~ z <e 



(2.14) 



(2.15) 



h 2 (S\\p) = (5 In - + (1 — <5) In \ — -. 

P 1 ~P 

Proof: We have after some simple algebra that 

e -h 2 (6)M = S 5M^ _ § M1 -S)M 

If S < 1/2, then S k (l — 6) M ~ k increases as k decreases (since we are having more 
large terms and fewer small terms in the product) and hence if i < MS, 

S SM ( 1 - <J)U W,5 ) M < ^(l - 5) m ~\ 

Thus we have the inequalities 



M 



i = 0 



1 = E 7 ^ - s ) M_i ^ E 



i<8M 



(1 



<P(1 - (5) 



M-i 



> e -h 2 (8)M ^ 
i<5M 

which completes the proof of (2.14). In a similar fashion we have that 
Mh 2 (5\\p) _ \8 m A ~ ^ \(1-8)M 

V 

Since 5 < p, we have as in the first argument that for i < MS 



(-) SM (- ^(l-5)M < ^ \M-i 

V y i-p } ~ V l l ~P 

and therefore after some algebra we have that if i < MS then 
p\l -p) M_i < <f(l - g)M-i e -Mh*(5\\p) 

and hence 

53 C^ f ')p < (l-p) M_< <e _M/la ( 4||p) 53 ( A - ) <**(!-<*) 

<rxi\/r V / i<8M ' ' 



M-i 



i<8M 
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< e ~ nh 2{S\\p) 




<P(1 - 5) M " i 



_ g— Mh 2 (5| |p) 



which proves (2.15). □ 

The following is a technical but useful property of sample entropies. The 
proof follows Billingsley [15]. 

Lemma 2.3.6: Given a finite alphabet process {X„} (not necessarily sta- 
tionary) with distribution m, let X% = (Xk, X^+i, ■ ■ ■ , Xk+ n -i) denote the 
random vectors giving a block of samples of dimension n starting at time k. 
Then the random variables rP l Inm(XJI) are m-uniformly integrable (uniform 
in k and n). 

Proof: For each nonnegative integer r define the sets 



E r (k, n ) 



{x : 



1 

n 



In m(x%) € [r,r + 1)} 



and hence if x € E r (k,n) then 



or 



r < — — lnm(^) < r + 1 



e~ nr > m{xl) > e ~ n{ - r+1) . 



Thus for any r 



f (——In m(X^))dm< (r + l)m(E r (k, n)) 

J E r (k,n) n 



= ( r + l ) E m ( x k) < (r+l)^2e nr 

x%£E r (k,n) x £ 

= (r + l)e~ nr \\A\\ n < (r + l)e~ nr , 

where the final step follows since there are at most | |A | \ n possible n-tuples corre- 
sponding to thin cylinders in E r (k,n) and by construction each has probability 
less than e~ nr . 

To prove uniform integrability we must show uniform convergence to 0 as 
r — > oo of the integral 

7 r (k,n) = f ( In m(X^)) dm 

J x:— ^ In m{x'^)'>r ™ 



r i 00 

/ ( In m(XJf)) dm < V(r + i + l) e - n(r+i) 1 1 A\ \ 

JE r+i (k,n) n i=0 



OO 

E, 

^_0 J E r -\-i(k) n) 



< ^(r + f + l)e" n(r+i " ln||A|l) . 
2—0 
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Taking r large enough so that r > In ||A||, then the exponential term is bound 
above by the special case n = 1 and we have the bound 

OO 

7 r (k,n) < ^(r + * + l)e- (r+i - ln||A|l) 

2=0 

a bound which is finite and independent of k and n. The sum can easily be 
shown to go to zero as r — > oo using standard summation formulas. (The 
exponential terms shrink faster than the linear terms grow.) □ 

Variational Description of Divergence 

Divergence has a variational characterization that is a fundamental property 
for its applications to large deviations theory [143] [31]. Although this theory 
will not be treated here, the basic result of this section provides an alternative 
description of divergence and hence of relative entropy that has intrinsic interest. 
The basic result is originally due to Donsker and Varadhan [34]. 

Suppose now that P and M are two probability measures on a common 
discrete probability space, say Given any real- valued random variable $ 

defined on the probability space, we will be interested in the quantity 

E M e*. (2.16) 

which is called the cumulant generating function of 4> with respect to M and 
is related to the characteristic function of the random variable 4) as well as to 
the moment generating function and the operational transform of the random 
variable. The following theorem provides a variational description of divergence 
in terms of the cumulant generating function. 

Theorem 2.3.2: 

D(P\\M) = sup ( E P <S> - In (E M (e*))) . (2.17) 

4> 

Proof: First consider the random variable 4> defined by 
4>(w) = In (P(u>)/M(u>)) 

and observe that 

E P i - ln(£„(e*)) = £ P M k Ad - 1»(£ *H^) 

= D(P\\M ) — In 1 = D(P\\M). 

This proves that the supremum over all 4> is no smaller than the divergence. 

To prove the other half observe that for any bounded random variable 4>, 

= s p <“> (>» %$) ■ 
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where the probability measure M® is defined by 



M*(w) 



Yj X 



We now have for any <f> that 



D(P\\Q)~ (Sp$-ln(£7 M (e*))) 



= E p H 

CO 




PH 

AI (uj) 



] - E p H ( ln 

' UJ ' 



Af(w) J 



using the divergence inequality. Since this is true for any d>, it is also true for 
the supremum over $ and the theorem is proved. □ 



2.4 Entropy Rate 

Again let {X n ; n = 0, 1, • • •} denote a finite alphabet random process and apply 
Lemma 2.3.2 to vectors and obtain 



H(X < 

H(Xq, X\, ■ ■ • , X m _i) + H{X mi X m+ \, ■ ■ • , X n _i); 0 < m < n. (2-18) 

Define as usual the random vectors X\ ] = (Xk, Xk+i, ■ ■ • , Xfc+„_i), that 
is, X is a vector of dimension n consisting of the samples of X from k to 
k + n — 1. If the underlying measure is stationary, then the distributions of 
the random vectors X % do not depend on k. Hence if we define the sequence 
h(n) = H{X n ) = H(X 0 , • • • , X n _\), then the above equation becomes 

h(k + n) < h(k ) + h{n); all k, n > 0. 

Thus h(n) is a subadditive sequence as treated in Section 7.5 of [50]. A basic 
property of subadditive sequences is that the limit h(n)/n as n — > oo exists and 
equals the infimum of h(n)/n over n. (See, e.g., Lemma 7.5.1 of [50].) This 
immediately yields the following result. 

Lemma 2.4.1: If the distribution m of a finite alphabet random process 
{X n } is stationary, then 

H m {X) = lim —H m (X n ) = inf —H m (X n ). 
oo n n> 1 n 

Thus the limit exists and equals the infimum. 



The next two properties of entropy rate are primarily of interest because 
they imply a third property, the ergodic decomposition of entropy rate, which 
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will be described in Theorem 2.4.1. They are also of some independent interest. 
The first result is a continuity result for entropy rate when considered as a func- 
tion or functional on the underlying process distribution. The second property 
demonstrates that entropy rate is actually an affine functional (both convex (J 
and convex p|) of the underlying distribution, even though finite order entropy 
was only convex (") and not affine. 

We apply the distributional distance described in Section 1.8 to the standard 
sequence measurable space (fi, B) = ( A z + , B^ + ) with a cr-field generated by the 
countable field T = { F n ; n = 1, 2, • • •} generated by all thin rectangles. 

Corollary 2 . 4 . 1 : The entropy rate H m (X) of a discrete alphabet random 
process considered as a functional of stationary measures is upper semicontinu- 
ous; that is, if probability measures m and m n , n = 1, 2, • • • have the property 
that d(m, m n ) — » 0 as n — ■> oo, then 

H m (X) > limsup H mn (X). 

n—> oo 

Proof: For each fixed n 

H m (X n ) = - MX" = a n ) In m(X n = a n ) 

a n eA n 

is a continuous function of m since for the distance to go to zero, the probabilities 
of all thin rectangles must go to zero and the entropy is the sum of continuous 
real- valued functions of the probabilities of thin rectangles. Thus we have from 
Lemma 2.4.1 that if d(nik,m) — » 0, then 

H m (X) = inf -H m (X n ) = inf - lim H mk {X n ) 

n Tl n Ti k—> oo 

> limsup (inf -H mk (X n ) ] = limsup H mk (X). □ 

k — xx> \ n Tl J fa — .oo 

The next lemma uses Lemma 2.3.4 to show that entropy rates are affine 
functions of the underlying probability measures. 

Lemma 2 . 4 . 2 : Let m and p denote two distributions for a discrete alphabet 
random process {X n }. Then for any A G (0, 1), 

A H m (X n ) + (1 - A )H p (X n ) < H Xm+{1 _ x)p {X n ) 

< A H m (X n ) + (1 - X)H. p (X n ) + h 2 { A), (2.19) 

and 

limsup(— [ dm(x)— ln(Am(X"(a;)) + (1 — A)p(X n (a’)))) 
n—> oo J Tl 

= limsup— [ drn(x)— In m(X n (x)) = H m (X). (2.20) 

n— ► oo J Tl 
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If to and p are stationary then 



fiW(i-A) P P0 = + (1 - A)flp(A-) (2.21) 

and hence the entropy rate of a stationary discrete alphabet random process is 
an affine function of the process distribution. □ 

Comment: Eq. (2.19) is simply Lemma 2.3.4 applied to the random vectors 
X n stated in terms of the process distributions. Eq. (2.20) states that if we 
look at the limit of the normalized log of a mixture of a pair of measures when 
one of the measures governs the process, then the limit of the expectation does 
not depend on the other measure at all and is simply the entropy rate of the 
driving source. Thus in a sense the sequences produced by a measure are able 
to select the true measure from a mixture. 

Proof: Eq. (2.19) is just Lemma 2.3.4. Dividing by n and taking the limit 
as n — > oo proves that entropy rate is affine. Similarly, take the limit supremum 
in expressions (2.12) and (2.13) and the lemma is proved. □ 

We are now prepared to prove one of the fundamental properties of entropy 
rate, the fact that it has an ergodic decomposition formula similar to property 
(c) of Theorem 1.8.2 when it is considered as a functional on the underlying 
distribution. In other words, the entropy rate of a stationary source is given by 
an integral of the entropy rates of the stationary ergodic components. This is a 
far more complicated result than property (c) of the ordinary ergodic decompo- 
sition because the entropy rate depends on the distribution; it is not a simple 
function of the underlying sequence. The result is due to Jacobs [68]. 

Theorem 2.4.1: The Ergodic Decomposition of Entropy Rate Let (A z + , B(A) Z + , to, T) 
be a stationary dynamical system corresponding to a stationary finite alphabet 
source {X n }. Let {p x } denote the ergodic decomposition of m. If H Px (X) is 
?n-integrable, then 

H m (X) = J dm(x)H Px (X). 

Proof: The theorem follows immediately from Corollary 2.4.1 and Lemma 
2.4.2 and the ergodic decomposition of semi-continuous affine funtionals as in 
Theorem 8.9.1 of [50]. □ 

Relative Entropy Rate 

The properties of relative entropy rate are more difficult to demonstrate. In 
particular, the obvious analog to (2.18) does not hold for relative entropy rate 
without the requirement that the reference measure by memoryless, and hence 
one cannot immediately infer that the relative entropy rate is given by a limit 
for stationary sources. The following lemma provides a condition under which 
the relative entropy rate is given by a limit. The condition, that the dominating 
measure be a kth order (or fc-step) Markov source will occur repeatedly when 
dealing with relative entropy rates. A source is fcth order Markov or k - step 
Markov (or simply Markov if k is clear from context) if for any n and any 
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N > k 

= X n \X n — i — X n _i, * * * , X n —x — X n _jv) 

— P(X n — x n | A n _ i — x n —\ , * * * , X n — k x n —k ) , 

that is, conditional probabilities given the infinite past depend only on the most 
recent k symbols. A 0-step Markov source is a memory less source. A Markov 
source is said to have stationary transitions if the above conditional probabilities 
do not depend on n, that is, if for any n 

— X n | X n — r — X n — 1 > ‘ ‘ ‘ , X n —N — X n — iv) 

— P(X k X n | i X n — 1 ; ' * * > Xq — X n —]f) . 

Lemma 2.4.3 If p is a stationary process and m is a fc-step Markov process 
with stationary transitions, then 

H pllm (X) = lim -H p l[m (X n ) = ~H p (X) - E p [lnm(X k \X k )\, 

where E p [lnm(X k \X k )\ is an abbreviation for 

E p [lnm(X k \X k )] = ^ p X k+i(x k+1 )\nm Xk \x k i.x k \x k ). 

1- 1 

Proof: If for any n it is not true that mx n » Px n , then H p \\ m (X n ) = oo for 
that and all larger n and both sides of the formula are infinite, hence we assume 
that all of the finite dimensional distributions satisfy the absolute continuity 
relation. Since in is Markov, 



n— 1 

m x »(x n ) = Y[m Xl \x'(xi\x l )m X k(x k ). 
l=k 



Thus 

-H pllm (X n ) = -~H p (X n ) --Vj,x»(/)him x »(/) 
n n n z — / 

x n 

= - -Hp{X n ) - - ^2p X k(x k )lnm X k(x k ) 
n n 

x k 

~~~ 51 Px^{x k+l )\nm Xk \x^{xk\x k ). 

x k + l 

Taking limits then yields 

H p \\m{X) = -H p - ^2 Px^{x k+1 )\nm Xk \x^{x k \x k ), 

X k + 1 

where the sum is well defined because if m Xk \x k (x k \x k ) = 0, then so must 
Pxk+i(x k+1 ) = 0 from absolute continuity. □ 
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Combining the previous lemma with the ergodic decomposition of entropy 
rate yields the following corollary. 

Corollary 2.4.2: The Ergodic Decomposition of Relative Entropy Rate 
Let ( A Z +,B(A) Z + , p , T) be a stationary dynamical system corresponding to a 
stationary finite alphabet source {X n }. Let m be a fcth order Markov process 
for which mj» >> px « for all n. Let {p x } denote the ergodic decomposition 
of p. If H p \\ m (X) is p-integrable, then 

Hp\\ m (X) = J dp(x)H P:c \\ m (X). 

2.5 Conditional Entropy and Information 

We now turn to other notions of information. While we could do without these 
if we confined interest to finite alphabet processes, they will be essential for 
later generalizations and provide additional intuition and results even in the 
finite alphabet case. We begin by adding a second finite alphabet measurement 
to the setup of the previous sections. To conform more to information theory 
tradition, we consider the measurements as finite alphabet random variables X 
and Y rather than f and g. This has the advantage of releasing / and g for use 
as functions defined on the random variables: f(X) and g(Y). Let P,T) 

be a dynamical system. Let X and Y be finite alphabet measurements defined 
on O with alphabets Ax and Ay. Define the conditional entropy of X given Y 

by 

H(X\Y) = H{X, Y) - H(Y). 

The name conditional entropy comes from the fact that 

H{X\Y) = P{X = a, Y = b) In P{X = a\Y = b ) 

x,y 

= ~^2px,Y{x,y) lnp x \ Y (x\y), 
x,y 

where p x ,y{x,y) is the joint pmf for (X,Y) and p x \ Y (x\y) = px,v{x,y) /pv(y) 
is the conditional pmf. Defining 

H(X\Y = y) = ~^2 ,Px\y{x \ y) \np x \ Y {x\y) 



we can also write 

H(X\Y) = J2PY(y)H(X\Y = y). 
v 

Thus conditional entropy is an average of entropies with respect to conditional 
pmf’s. We have immediately from Lemma 2.3.2 and the definition of conditional 
entropy that 



0 < H(X\Y) < H(X). 



(2.22) 
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The inequalities could also be written in terms of the partitions induced by X 
and Y. Recall that according to Lemma 2.3.2 the right hand inequality will be 
an equality if and only if X and Y are independent. 

Define the average mutual information between X and Y by 

/(X; Y) = H(X) + H(Y) - H(X, Y) 



= H{X) - H{X\Y) = H(Y) - H(Y\X). 
In terms of distributions and pmf’s we have that 



I(X-,Y) = J2 P (X = x,Y = y) In 

x,y 



P(X = x,Y = y) 
P(X = x)P(Y = y) 



= ^2,Px,Y{x,y) In 
x,y 



Px,y(x,v) 

p x {x)p Y (y) 



^ ^Px,y(x,V ) In 

x,y 



Px\y(x\v) 

Px{x) 



= X Px,y{x,v ) In 
x,y 



py\x(v\x ) 
Py(v) 



Note also that mutual information can be expressed as a divergence by 



I{X;Y) = D(P XY \\P X x P Y ), 



where Px x Py is the product measure on X 1 Y, that is, a probability measure 
which gives X and Y the same marginal distributions as Pxy , but under which 
X and Y are independent. Entropy is a special case of mutual information since 



H(X)=I(X;X). 



We can collect several of the properties of entropy and relative entropy and 
produce corresponding properties of mutual information. We state these in the 
form using measurements, but they can equally well be expressed in terms of 
partitions. 

Lemma 2.5.1: Suppose that X and Y are two finite alphabet random 
variables defined on a common probability space. Then 

0 < I(X;Y) < min (H(X),H(Y)). 

Suppose that / : Ax — > A and g : Ay —■ ► B are two measurements. Then 

I(f(X)-g(Y))<I(X-,Y). 



Proof: The first result follows immediately from the properties of entropy. 
The second follows from Lemma 2.3.3 applied to the measurement (/, g) since 
mutual information is a special case of relative entropy. □ 

The next lemma collects some additional, similar properties. 
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Lemma 2.5.2: Given the assumptions of the previous lemma, 

H(f(X) \X)=0, 

H(XJ(X)) = H(X), 

H(X) = H(f(X)) + H(X\f(X), 

J(X;/(X)) = ff(/(X)), 

H(X\g(Y )) > H(X\Y), 

I(f(X)-g(Y))<I(X- Y), 

H(X\Y) = H(X,f(X,Y))\Y), 

and, if Z is a third finite alphabet random variable defined on the same proba- 
bility space, 

H(X\Y) > H(X\Y,Z). 

Comments: The first relation has the interpretation that given a random 
variable, there is no additional information in a measurement made on the 
random variable. The second and third relationships follow from the first and 
the definitions. The third relation is a form of chain rule and it implies that given 
a measurement on a random variable, the entropy of the random variable is given 
by that of the measurement plus the conditional entropy of the random variable 
given the measurement. This provides an alternative proof of the second result 
of Lemma 2.3.3. The fifth relation says that conditioning on a measurement of 
a random variable is less informative than conditioning on the random variable 
itself. The sixth relation states that coding reduces mutual information as well 
as entropy. The seventh relation is a conditional extension of the second. The 
eighth relation says that conditional entropy is nonincreasing when conditioning 
on more information. 

Proof: Since g(X) is a deterministic function of X, the conditional pmf is 
trivial (a Kronecker delta) and hence H(g(X)\X = x ) is 0 for all x, hence the 
first relation holds. The second and third relations follow from the first and the 
definition of conditional entropy. The fourth relation follows from the first since 
/(X; Y) = H{Y) — H{Y\X). The fifth relation follows from the previous lemma 
since 



H(X) - H(X\g(Y)) = I(X ; g(Y)) < J(X; Y) = H(X) - H{X\Y). 
The sixth relation follows from Corollary 2.3.2 and the fact that 
/(X; Y) =D(P x ,y\\Px x Py). 

The seventh relation follows since 



h(x, /(x, y))|y) = h(x, /(x, y)), y) - h(y) 

= H{X,Y) - H(Y) = H(X\Y). 
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The final relation follows from the second by replacing Y by Y. Z and setting 
g(Y,Z) = Y.U 

In a similar fashion we can consider conditional relative entropies. Suppose 
now that M and P are two probability measures on a common space, that X 
and Y are two random variables defined on that space, and that Mxy » Pxy 
(and hence also Mx » Py)- Analagous to the definition of the conditional 
entropy we can define 



Hp\\ M {X\Y) = H PllM (X,Y) - H p \\ m {Y). 



Some algebra shows that this is equivalent to 



H p \\m{X\Y) = ^2,p x , Y {x,y) In 



Px\y(x\v) 

m x \Y{x\y) 



^ 2 px { x ) (px|y(a%)ln 



Px\y(x\v) \ 

m x \Y{x\y)J 



This can be written as 



(2.23) 



Hp\\ M {X\Y) = ^2pY(y)D(px\ Y (-\y)\\m x \Y{-\y)), 
v 

an average of divergences of conditional pmf’s, each of which is well defined 
because of the original absolute continuity of the joint measure. Manipulations 
similar to those for entropy can now be used to prove the following properties 
of conditional relative entropies. 

Lemma 2.5.3 Given two probability measures M and P on a common space, 
and two random variables X and Y defined on that space with the property that 
M X y » Pxy > then the following properties hold: 



H P \\m{S{X)\X) = 0, 

H PllM (X,f(X)) = H PllM (X), 

H p \\ m (X) = H P \\ M (f (X)) + H PllM (X\f(X)), (2.24) 

If Mxy = Mx x My (that is, if the pmfs satisfy mx,Y{x,y) = mx(x)m.Y{y)), 
then 

H P \\ M (X,Y)>Hp\\ M (X) + Hp\\ M (Y) 

and 

H Pm {X\Y) > H Pm {X). 

Eq. (2.24) is a chain rule for relative entropy which provides as a corollary an 
immediate proof of Lemma 2.3.3. The final two inequalities resemble inequalities 
for entropy (with a sign reversal), but they do not hold for all reference measures. 

The above lemmas along with Lemma 2.3.3 show that all of the informa- 
tion measures thus far considered are reduced by taking measurements or by 
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coding. This property is the key to generalizing these quantities to nondiscrete 
alphabets. 

We saw in Lemma 2.3.4 that entropy was a convex fj function of the under- 
lying distribution. The following lemma provides similar properties of mutual 
information considered as a function of either a marginal or a conditional dis- 
tribution. 

Lemma 2.5.4: Let /i denote a pmf on a discrete space A x , y(x) = Pr(X = 
a;), and let q be a conditional pmf, q(y\x) = Pr(y = y\X = x). Let yq denote the 
resulting joint pmf yq{x,y) = y(x)q(y\x). Let I^q = I m {X\Y) be the average 
mutual information. Then I^ q is a convex (J function of q; that is, given two 
conditional pmf’s qi and q 2 , a A € [0, 1], and q = Xqi + (1 — X)q 2 , then 

— ^nqi T (1 ~ A)/^,j 2 , 

and I jlq is a convex fj function of /i, that is, given two pmf’s /y and y 2 , A e [0, 1], 
and p, = A/y + (1 — X)H 2 , 



I fa > A/ Ml g + (1 ~ A)/^ 



Proof: Let r (respectively, rq, r 2 , r) denote the pmf for Y resulting from q 
(respectively qi, q 2 , q), that is, r{y) = Pr(F = y) = 'f2 x y(x)q{y\x). From (2.5) 

T / w m ( T(x)q(x,y) y(x)n(y) y,{x)q 1 (x,y) 

/« = *$>(*)».(*, *)ios ( wm wmta ,»,(») 

•X' i y 

/ w \ i { v(x)q{x,y) n{x)r 2 {y) y{x)q 2 {x,y) 

+(! - A > y,(x)q 2 {x, y) log — yw-y T — 7~N~~TX 

V T\ x ) r \y) V(x)q 2 (x,y) y{x)r 2 (y) 



x,y 



E ixqi(x,y) 



x,y 



- 1 



+ (! - A)/ Mg2 + (! - a )E y(x)q 2 (x,y) 



y{x)q(x,y) y(x)n{y) 
y{x)r(y) y,(x)qi(x,y) 

y.(x)q(x,y) n(x)r 2 (y) 



x,y 



K x )r(y) p(x)q 2 (x,y) 



- 1 



— + (1 - A )Ifi q2 + A(— 1 + X] (y)) 

x >y 

+(1 - A)(— 1 + £ KX l 9 ^ ,V) r 2 (y)) = A/ M91 + (1 - A)J M92 . 

Xi >y 

Similarly, let p = Xyi + (1 — A) y 2 and let r i, r 2 , and f denote the induced 
output pmf’s. Then 

t ( \ / i m /^(yk) r i(y) y(yk) 

= ^y.Pl(x)q{y\x)log — y— rr yy 

V r \y) qyy\ x ) niy) 



+(i -A)E Ai2(*)9(|/|* 

x ,y 




q{y\x) r 2 {y) g(y|s) \ 
r(y) «(yk) ^2 (3/) ) 
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= a I^q + (1 - A )/ M2 g - A ^ m(x)q(y\x) log 

x,y 7 1 12/1 

-(1 - A ) ^(xjqivlx) log > A I^q + (1 - A )I^q 
from another application of (2.5). □ 

We consider one other notion of information: Given three finite alphabet 
random variables X, Y, Z , define the conditional mutual information between X 
and Y given Z by 

I(X;Y\Z) = D(P xyz \\Pxxy\z) (2.25) 

where PxxY\z is the distribution defined by its values on rectangles as 



Pxxy\z(F xGxD)=Y ] P{X G F\Z = z)P{Y G G\Z = z)P{Z = z). (2.26) 
ze d 



PxxY\z has the same conditional distributions for X given Z and for Y given 
Z as does Pxyzi but now X and Y are conditionally independent given Z . Al- 
ternatively, the conditional distribution for X, Y given Z under the distribution 
PxxY\z is the product distribution Px\Z x Py\Z. Thus 



I(X-Y\Z)= ^2pxYz(x,y,z) In 

x,y,z 



Pxyz(x, y, z) 
Px\z(x\ z)p Y \z(y\ z)pz (z) 



, ,, Pxv\z{x,y\z) 

= y, PxYz(x,y,z) In — — 

^ Px\z(x z)p Y \z(y z) 



Since 



x,y,z 

PXYZ 



PXYZ PX 
X 



PXYZ PY 
X 



Px\zPy\zPz PxPyz Px\z PxzPy Py\z 
we have the first statement in the following lemma. 

Lemma 2.5.4: 



(2.27) 



I(X- Y\Z) + I(Y ; Z) = I(Y ; (X, Z)), (2.28) 

I(X;Y\Z)>0, (2.29) 

with equality if and only if X and Y are conditionally independent given Z , 
that is, Pxy\z = Px\zPy\z- Given finite valued measurements / and g , 

I(f(X)-g(Y)\Z)<I(X-Y\Z). 



Proof: The second inequality follows from the divergence inequality (2.6) 
with P = P xy z and M = P Xx y\z , he., the pmf’s p X YZ and Px\zPy\zPz ■ The 
third inequality follows from Lemma 2.3.3 or its corollary applied to the same 
measures. □ 

Comments: Eq. (2.28) is called Kolmogorov’s formula. If X and Y are 
conditionally independent given Z in the above sense, then we also have that 
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Px\yz = Pxy\z /py\z = Px\z> in which case we say that Y — > Z — » X is a 
Markov chain and note that given Z, X does not depend on Y. (Note that if 
Y — > Z — > X is a Markov chain, then so is X — » Z — > Y.) Thus the conditional 
mutual information is 0 if and only if the variables form a Markov chain with 
the conditioning variable in the middle. One might be tempted to infer from 
Lemma 2.3.3 that given finite valued measurements /, g, and r 

I(f(Xy,g(Y)\r(Z)) { ^I(X;Y\Z). 

This does not follow, however, since it is not true that if Q is the partition 
corresponding to the three quantizers, then D{Pf(x),g<Y),r(z)\\Pf(x)xg(Y)\r(z)) 
is H PX ' Y ' ZllPxxYlz (f(X),g(Y),r(Z)) because of the way that P X xY\z is con- 
structed; e.g., the fact that X and Y are conditionally independent given Z 
implies that f(X ) and g(Y) are conditionally independent given Z, but it does 
not imply that f(X) and g(Y) are conditionally independent given r{Z). Al- 
ternatively, if M is P X xz\Yi then it is not true that Pf(x)xg(Y)\r(z) equals 
M(fgr)- 1 . Note that if this inequality were true, choosing r(z) to be trivial 
(say 1 for all z) would result in I(X\Y\Z) > I(X;Y\r(Z)) = I(X-Y). This 
cannot be true in general since, for example, choosing Z as (A, Y) would give 
I(X\Y\Z) = 0. Thus one must be careful when applying Lemma 2.3.3 if the 
measures and random variables are related as they are in the case of conditional 
mutual information. 

We close this section with an easy corollary of the previous lemma and of the 
definition of conditional entropy. Results of this type are referred to as chain 
rules for information and entropy. 

Corollary 2.5.1: Given finite alphabet random variables Y, X\, X 2 , ■■■, 

X n , 

n 

H(X U X 2 , ■ ■ ■ , X n ) = J2 H (X l \X 1 , ■ ■ ■ , Xi-i) 

i= 1 
n 

? X 2 1 > Xn) = ''y ^ Hp\\m(.Xi\Xi, * * * , i) 

2=1 

n 

I(Y ; (X U X 2 , • • • , X n )) = Y, I(X\ Xi\X ir --, Xi-i). 

2=1 

2.6 Entropy Rate Revisited 

The chain rule of Corollary 2.5.1 provides a means of computing entropy rates 
for stationary processes. We have that 
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First suppose that the source is a stationary fcth order Markov process, that 
is, for any m > k 



Pv(X n = x n \Xi = xf, i = 0, 1,- • • ,n- 1) 

= Pr(X n = x n \Xi = Xi\ i = n - k,- ■ ■ ,n - 1). 

For such a process we have for all n > k that 

H{X n \X n ) = H(X n \X k _ k ) = H{X k \X k ), 

where X™ = X{, ■ ■ ■ , X, +m _ i. Thus taking the limit as n — » oo of the nth order 
entropy, all but a finite number of terms in the sum are identical and hence the 
Cesaro (or arithmetic) mean is given by the conditional expectation. We have 
therefore proved the following lemma. 

Lemma 2 . 6 . 1 : If {X n } is a stationary kth order Markov source, then 

H(X) = H(X k \X k ). 

If we have a two-sided stationary process { X n }, then all of the previous defi- 
nitions for entropies of vectors extend in an obvious fashion and a generalization 
of the Markov result follows if we use stationarity and the chain rule to write 

1 1 "~ 1 

-H(x n ) = -J2 h(x 
n n 

i - o 

Since conditional entropy is nonincreasing with more conditioning variables 
((2.22) or Lemma 2.5.2), H(X o|X_i, • • • ,X-i) has a limit. Again using the fact 
that a Cesaro mean of terms all converging to a common limit also converges 
to the same limit we have the following result. 

Lemma 2 . 6 . 2 : If {X n } is a two-sided stationary source, then 
S(X)= lim H(Xq\X-i, • • • , X_ n ). 

n— >oo 

It is tempting to identify the above limit as the conditional entropy given 
the infinite past, H(X 0 \X_i, • • •). Since the conditioning variable is a sequence 
and does not have a finite alphabet, such a conditional entropy is not included 
in any of the definitions yet introduced. We shall later demonstrate that this 
interpretation is indeed valid when the notion of conditional entropy has been 
suitably generalized. 

The natural generalization of Lemma 2.6.2 to relative entropy rates unfor- 
tunately does not work because conditional relative entropies are not in general 
monotonic with increased conditioning and hence the chain rule does not imme- 
diately yield a limiting argument analogous to that for entropy. The argument 
does work if the reference measure is a fcth order Markov, as considered in the 
following lemma. 
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Lemma 2.6.3: If {X n } is a source described by process distributions p and 
to and if p is stationary and to is kth order Markov with stationary transitions, 
then for n> k H p || m (Xo|A_i, • • • , X_ n ) is nondecreasing in n and 

= hm H p \\ m (Xo\X_\, • • • , X- n ) 

n—> oo 

= -H P (X) - E p [lnm(X k \X k )]. 

Proof: For n > k we have that 

H p \\ m (X 0 \X- U ? X—n) 

= -H P (X - ^2 pxk+i(x k+1 )lnm Xk \x*{xk\x k ). 

x k + i 

Since the conditional entropy is nonincreasing with n and the remaining term 
does not depend on n, the combination is nondecreasing with n. The remainder 
of the proof then parallels the entropy rate result. □ 

It is important to note that the relative entropy analogs to entropy properties 
often require fcth order Markov assumptions on the reference measure (but not 
on the original measure). 

Markov Approximations 

Recall that the relative entropy rate H p \\ m (X) can be thought of as a distance 
between the process with distribution p and that with distribution m and that 
the rate is given by a limit if the reference measure in is Markov. A particular 
Markov measure relevant to p is the distribution p^' 1 which is the kth order 
Markov approximation to p in the sense that it is a kth order Markov source 
and it has the same /cth order transition probabilities as p. To be more precise, 
the process distribution p ^ is specified by its finite dimensional distributions 

Pxl{x k ) =p X k{x k ) 

n — 1 

Pxn(x n ) =Px*(x k ) Y[p Xl \x^_ k (xi\xi_ k y, n= k, k + !,■■■ 



Px k \x* ~Px k \x”- 

It is natural to ask how good this approximation is, especially in the limit, that 
is, to study the behavior of the relative entropy rate H p ^ p ( t k)(X) as k — > oo. 

Theorem 2.6.2: Given a stationary process p, let p( k > denote the fctli order 
Markov approximations to p. Then 

lim H p || p (fc)(X) = inf H p ^ pW (X) = 0. 

/c ^ oo /c 
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Thus the Markov approximations are asymptotically accurate in the sense that 
the relative entropy rate between the source and approximation can be made 
arbitrarily small (zero if the original source itself happens to be Markov) . 
Proof: As in the proof of Lemma 2.6.3 we can write for n > k that 

Hp\\ p (k) (Ao|A_i, • • • , A_„) 

= -H P (X 0 - ,X_„) - ^2 Px«+i{x k+1 )Rip XklX k(x k \x k ) 

x k + 1 

= H P (X 0 |X_ 1; • • • , X_ k ) - H P (X o|A_!, • • • , X_ n ). 

(k) 

Note that this implies that p\ n » px n for all n since the entropies are finite. 
This automatic domination of the finite dimensional distributions of a measure 
by those of its Markov approximation will not hold in the general case to be 
encountered later, it is specific to the finite alphabet case. Taking the limit as 
n — * oo gives 



Hp\\p(k) (A) — ^lim i/piipO) (Xo|A_i, • • • , A_„) 

= H P (X 0 \X_ U - ■ -,X_ k ) - H p (X). 

The corollary then follows immediately from Lemma 2.6.2. □ 

Markov approximations will play a fundamental role when considering rela- 
tive entropies for general (nonfinite alphabet) processes. The basic result above 
will generalize to that case, but the proof will be much more involved. 



2.7 Relative Entropy Densities 

Many of the convergence results to come will be given and stated in terms 
of relative entropy densities. In this section we present a simple but important 
result describing the asymptotic behavior of relative entropy densities. Although 
the result of this section is only for finite alphabet processes, it is stated and 
proved in a manner that will extend naturally to more general processes later 
on. The result will play a fundamental role in the basic ergodic theorems to 
come. 

Throughout this section we will assume that M and P are two process 
distributions describing a random process {X n }. Denote as before the sample 
vector X n = (A 0 , X 1 , ■ ■ ■ , A„_ 1 ), that is, the vector beginning at time 0 having 
length n. The distributions on X n induced by M and P will be denoted by 
M n and P„, respectively. The corresponding pmf’s are mx n and px n ■ The 
key assumption in this section is that for all n if mx n {x n ) = 0, then also 
Px n {x n ) = 0, that is, 



M n » P n for all n. 



(2.30) 
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If this is the case, we can define the relative entropy density 



h n (x) = In 



Px » (a-’ n ) 
mx n {x n ) 



In f n (x), 



(2.31) 



where 



fn(x) 



p.v V") 
m x n(x n ) 

0 



if mx™ ( x n ) ^ 0 
otherwise 



(2.32) 



Observe that the relative entropy is found by integrating the relative entropy 
density: 



H P \\ M (X n ) = D(P n \\M n ) = ^pxn(/) hr PXn{ fl 

mx^\x n ) 



J m x » (^ n ) 

Thus, for example, if we assume that 

7?p|| M (X") < 00 , all n, 



(2.33) 



(2.34) 



then (2.30) holds. 

The following lemma will prove to be useful when comparing the asymptotic 
behavior of relative entropy densities for different probability measures. It is the 
first almost everywhere result for relative entropy densities that we consider. It 
is somewhat narrow in the sense that it only compares limiting densities to zero 
and not to expectations. We shall later see that essentially the same argument 
implies the same result for the general case (Theorem 5.4.1), only the interim 
steps involving pmf’s need be dropped. Note that the lemma requires neither 
stationarity nor asymptotic mean stationarity. 

Lemma 2.7.1: Given a finite alphabet process {X n } with process measures 
P,M satisfying (2.30), Then 

limsup —h n < 0, M — a.e. (2.35) 

n—> oo 'Tl 

and 

liminf —h n > 0, P — a.e.. (2.36) 

n—* oo n 

If in addition M » P, then 

lim —h n = 0, P — a.e.. 

n—> oo 77, 



Proof: First consider the probability 

M(-h n >e) = M(f n > e ne ) < ^*4^, 
n e ne 



(2.37) 
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where the final inequality is Markov’s inequality. But 



E M (fn) 




x n : 



E m. X n(x n ) 

mx n (x ri )^0 



PX"{x n ) 
mx n ( x n ) 



and therefore 



and hence 



E Px<x n )< 1 

x n : mxn- (x n )y£0 



M( — h n > e) < 2 
n 



—ne 



CXJ -J CXJ 

E M(—h n > e) < e~ ne < oo. 

n=l U n= 1 



From the Borel-Cantelli Lemma (e.g., Lemma 4.6.3 of [50]) this implies that 
M(n~ 1 h n > e i.o.) = 0 which implies the first equation of the lemma. 

Next consider 



P{~-K > e) = E Px*{x n ) 

x n :-± In p X n (x n )/m X n (x n )>e 



= E PXn(x n ) 

x n :—^lnpxn-(x n )/mx n (x n )>e and mx”-(x n )=£ 0 

where the last statement follows since if mx»(x") = 0, then also px n {x n ) = 0 
and hence nothing would be contributed to the sum. In other words, terms 
violating this condition add zero to the sum and hence adding this condition to 
the sum does not change the sum’s value. Thus 

P(--h n > e) = 
n 



E 



^ln px n (x ri )/mxn-(x ri )>e and mx”- (x n )^0 



Px » ( X n ) 
mx n (x n ) 



m X "-(x n ) 



1 fn<e~ 



dMf n < 



cLMe 



' fn<e~ 



= e~ ne M(f n < e~ ne ) < e~ ne . 

Thus as before we have that P(n~ 1 h n > e) < e~ ne and hence that P(n~ l h n < 
— e i.o.) = 0 which proves the second claim. If also M » P, then the first 
equation of the lemma is also true P-a.e., which when coupled with the second 
equation proves the third. □ 




Chapter 3 

The Entropy Ergodic 
Theorem 

3.1 Introduction 

The goal of this chapter is to prove an ergodic theorem for sample entropy of 
finite alphabet random processes. The result is sometimes called the ergodic 
theorem of information theory or the asymptotic equipartion theorem, but it is 
best known as the Shannon-McMillan-Breiman theorem. It provides a common 
foundation to many of the results of both ergodic theory and information the- 
ory. Shannon [129] first developed the result for convergence in probability for 
stationary ergodic Markov sources. McMillan [103] proved L 1 convergence for 
stationary ergodic sources and Breiman [19] [20] proved almost everywhere con- 
vergence for stationary and ergodic sources. Billingsley [15] extended the result 
to stationary nonergodic sources. Jacobs [67] [66] extended it to processes dom- 
inated by a stationary measure and hence to two-sided AMS processes. Gray 
and Kieffer [54] extended it to processes asymptotically dominated by a sta- 
tionary measure and hence to all AMS processes. The generalizations to AMS 
processes build on the Billingsley theorem for the stationary mean. Follow- 
ing generalizations of the definitions of entropy and information, corresponding 
generalizations of the entropy ergodic theorem will be considered in Chapter 8. 

Breiman’s and Billingsley’s approach requires the martingale convergence 
theorem and embeds the possibly one-sided stationary process into a two-sided 
process. Ornstein and Weiss [117] recently developed a proof for the stationary 
and ergodic case that does not require any martingale theory and considers 
only positive time and hence does not require any embedding into two-sided 
processes. The technique was described for both the ordinary ergodic theorem 
and the entropy ergodic theorem by Shields [132]. In addition, it uses a form 
of coding argument that is both more direct and more information theoretic in 
flavor than the traditional martingale proofs. We here follow the Ornstein and 
Weiss approach for the stationary ergodic result. We also use some modifications 
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similar to those of Katznelson and Weiss for the proof of the ergodic theorem. 
We then generalize the result first to nonergodic processes using the “sandwich” 
technique of Algoet and Cover [7] and then to AMS processes using a variation 
on a result of [54]. 

We next state the theorem to serve as a guide through the various steps. We 
also prove the result for the simple special case of a Markov source, for which 
the result follows from the usual ergodic theorem. 

We consider a directly given finite alphabet source {X n } described by a 
distribution m on the sequence measurable space Define as previously 

X% = (Xk, Xk+i, ■ ■ ■ , Xk+n-i)- The subscript is omitted when it is zero. For 
any random variable Y defined on the sequence space (such as Xj?) we define 
the random variable m(Y ) by m(Y)(x) = m{Y = Y(x)). 

Theorem 3.1.1: The Entropy Ergodic Theorem 

Given a finite alphabet AMS source {X n } with process distribution m and 
stationary mean to, let {fh x \x £ 0} be the ergodic decomposition of the sta- 
tionary mean to. Then 



lim 

n—>oo 



— In m(X n ) 
n 



= h; to — a.e. and in L 1 (to), 



(3.1) 



where h{x) is the invariant function defined by 



h(x) = Hfh x (X). 



(3.2) 



Furthermore, 

E m h = lim 1 H m (X n ) = H m (Xy t (3.3) 

n— »■ oo 77, 

that is, the entropy rate of an AMS process is given by the limit, and 



Hfh(X) = H m (X). 



(3.4) 



Comments: The theorem states that the sample entropy using the AMS 
measure to converges to the entropy rate of the underlying ergodic component 
of the stationary mean. Thus, for example, if m is itself stationary and er- 
godic, then the sample entropy converges to the entropy rate of the process 
in- a.e. and in L 1 (to). The L l (m) convergence follows immediately from the 
almost everywhere convergence and the fact that sample entropy is uniformly 
integrable (Lemma 2.3.6). L 1 convergence in turn immediately implies the left- 
hand equality of (3.3). Since the limit exists, it is the entropy rate. The final 
equality states that the entropy rates of an AMS process and its stationary mean 
are the same. This result follows from (3.2)-(3.3) by the following argument: 
We have that H m (X) = E m h and Hfh{X ) = E r - n h, but h is invariant and hence 
the two expectations are equal (see, e.g., Lemma 6.3.1 of [50]). Thus we need 
only prove almost everywhere convergence in (3.1) to prove the theorem. 

In this section we limit ourselves to the following special case of the theo- 
rem that can be proved using the ordinary ergodic theorem without any new 
techniques. 
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Lemma 3.1.1: Given a finite alphabet stationary fcth order Markov source 
{X„}, then there is an invariant function h such that 

— lnm(X") , . T i, ^ 

lim = h; m — a.e. and m L (ra), 

n—> oo Ti 

where h is defined by 

h(x) = -Ern x lnm(X k \X k ), (3.5) 

where {fh x } is the ergodic decomposition of the stationary mean fh. Further- 
more, 

h(x) = Hra x (X) = H„ x (X k \X k ). (3.6) 

Proof of Lemma: We have that 

1 1 " -1 

— lnm(X n ) = V h\m(Xi\X i ) . 

n n z -—' 

i = o 

Since the process is kth. order Markov with stationary transition probabilites, 
for i > k we have that 

mpGiy = m(Xi\Xi_k, • • • , Xi_i) = m(X k \X k )T l ~ k . 

The terms — In ?n(X.j| X 1 ), i = 0, 1, • • • , k — 1 have finite expectation and hence 
are finite to- a.e. so that the ergodic theorem can be applied to deduce 

— In m(X n )(x) = - i V lnTO(X fc |A’ fc )(T i - fc *) 

n n / n 

2=0 i=k 

1 k — 1 1 n—k — 1 

= yinmpqjnor) - y lnTO(X fc |X fc )(T i a;) 

2=0 2=0 

- £™ x (-lnm(X fc y)), 

n— ^ oo 

proving the first statement of the lemma. It follows from the ergodic decom- 
position of Markov sources (see Lemma 8.6.3) of [50]) that with probability 1, 
fh x (X k \X k ) = m(Xk\if(x),X k ) = in{X k \X k ), where ip is the ergodic component 
function. This completes the proof. □ 

We prove the theorem in three steps: The first step considers stationary 
and ergodic sources and uses the approach of Ornstein and Weiss [117] (see also 
Shields [132]). The second step removes the requirement for ergodicity. This 
result will later be seen to provide an information theoretic interpretation of 
the ergodic decomposition. The third step extends the result to AMS processes 
by showing that such processes inherit limiting sample entropies from their 
stationary mean. The later extension of these results to more general relative 
entropy and information densities will closely parallel the proofs of the second 
and third steps for the finite case. 
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3.2 Stationary Ergodic Sources 

This section is devoted to proving the entropy ergodic theorem for the special 
case of stationary ergodic sources. The result was originally proved by Breiman 
[19]. The original proof first used the martingale convergence theorem to infer 
the convergence of conditional probabilities of the form m(X o|X_i, X_ 2 , • • • , X-k 
to m(X o|X_i, X_ 2 , • • •)• This result was combined with an an extended form of 
the ergodic theorem stating that if gk —■ > 9 as k — > oo and if g k is T-dominated 
(sup fc \gk\ is in L 1 ), then l/«Efc=o 9 kT k has the same limit as 1/n^fco ■ 
Combining these facts yields that that 

1 1 ” _1 
-himpr) = - V lnm(X k \X k ) 
n n ' 

k—0 

1 n— 1 

= n E lnw (^o \ X - k ) Tk 

1 k—0 

has the same limit as 

1 n— 1 

- Vlnm(X 0 |X_i,X_ 2 ,---)T fc 
n A ' 
k—0 

which, from the usual ergodic theorem, is the expectation 

£(lnm(X 0 |X-) = E(\Yvm{X 0 \X- U X- 2 ,- ■ ■)). 

As suggested at the end of the proceeding chapter, this should be minus the 
conditional entropy H(X o|X_i, X_ 2 , • • •) which in turn should be the entropy 
rate H x ■ This approach has three shortcomings: it requires a result from mar- 
tingale theory which has not been proved here or in the companion volume [50] , 
it requires an extended ergodic theorem which has similarly not been proved 
here, and it requires a more advanced definition of entropy which has not yet 
been introduced. Another approach is the sandwich proof of Algoet and Cover 
[7]. They show without using martingale theory or the extended ergodic theo- 
rem that 1 l'n E"=o hi rn(Xo\X l _ i )T l is asymptotically sandwiched between the 
entropy rate of a fcth order Markov approximation: 

1 n— 1 

- E lnm(X 0 |X^)T i E m [lnm(X 0 \X k _ k )} = -H(X 0 \X k _ k ) 

i=k 

and 

. n— 1 

- ^ lnrn(X 0 |A'_ 1 , X_ 2 , • • -)T* - ^ m [lnm(A 0 |A 1 , • • •)] 
i=k 

= —H(X 0 \X- 1 ,X-2,- ■ •)■ 

By showing that these two limits are arbitrarily close as k — » oo, the result is 
proved. The drawback of this approach for present purposes is that again the 
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more advanced notion of conditional entropy given the infinite past is required. 
Algoet and Cover’s proof that the above two entropies are asymptotically close 
involves martingale theory, but this can be avoided by using Corollary 5.2.4 as 
will be seen. 

The result can, however, be proved without martingale theory, the extended 
ergodic theorem, or advanced notions of entropy using the approach of Ornstein 
and Weiss [117], which is the approach we shall take in this chapter. In a later 
chapter when the entropy ergodic theorem is generalized to nonfinite alphabets 
and the convergence of entropy and information densities is proved, the sandwich 
approach will be used since the appropriate general definitions of entropy will 
have been developed and the necessary side results will have been proved. 

Lemma 3.2.1: Given a finite alphabet source {A'„} with a stationary er- 
godic distribution to, we have that 



lim 

n—*oo 



— In m(X n ) 
n 



= h; m — a.e., 



where h{ x) is the invariant function defined by 



h(x) = H m (X). 



Proof: Define 



h n (x ) = — lnTO(A n )(a’) = — lnm.(a; n ) 



and 

h(x) = lim inf -h n (x) = lim inf ~ , 

n—> oo 77, n—> oo 77 

Since m((xo, ■■■, x n -\)) < m((xi, ■■■ we have that 

h n (x) > h n ^i(Tx). 



Dividing by n and taking the limit infimum of both sides shows that h(x) > 
h[Tx). Since the n^ 1 h n are nonnegative and uniformly integrable (Lemma 
2.3.6), we can use Fatou’s lemma to deduce that h and hence also hT are 
integrable with respect to to. Integrating with respect to the stationary measure 
in yields 

which can only be true if 



h(x) = h(Tx ); m — a.e., 

that is, if h is an invariant function with m-probability one. If h is invariant 
almost everywhere, however, it must be a constant with probability one since 
m is ergodic (Lemma 6.7.1 of [50]). Since it has a finite integral (bounded by 
H m (X)), h must also be finite. Henceforth we consider h to be a finite constant. 




52 



CHAPTER 3. THE ENTROPY ERGODIC THEOREM 



We now proceed with steps that resemble those of the proof of the ergodic 
theorem in Section 7.2 of [50]. Fix e > 0. We also choose for later use a S > 0 
small enough to have the following properties: If A is the alphabet of Xq and 
||7l|| is the finite cardinality of the alphabet, then 

<5 In 1 1 ^4 1 1 < e, (3.7) 



and 

-8 In <5 - (1 - 8) ln(l ~ S) = h 2 (S) < e. (3.8) 

The latter property is possible since h 2 (S) — > 0 as <5 — > 0. 

Define the random variable n(x) to be the smallest integer n for which 
n~ l h n (x) < h + e. By definition of the limit infimum there must be infinitely 
many n for which this is true and hence n(x) is everywhere finite. Define the 
set of “bad” sequences by B = {x : n(x) > N} where N is chosen so large 
that m(B) < S/2. Still mimicking the proof of the ergodic theorem, we define 
a bounded modification of n(x) by 



h(x) 



n(x) x £ B 
1 x € B 



so that h(x) < N for all x £ B c . We now parse the sequence into variable-length 
blocks. Iteratively define n k {x) by 



n 0 (x) = 0 
n \{x) = n(x) 

n 2 (x) = ni(x) + h(T ni ^x) = ni(x) + h(x) 



n k +i{x) = n k (x) + h{T nk(xS> x) = n k ( x) + h(x), 
where l k (x) is the length of the fcth block: 

l k (x) =h(T nk ^x). 



We have parsed a long sequence x L = (xo, ■ • • , Xl-i), where L » N, 
into blocks x nk ( x ), • • • , x nk+1 ^_i = which begin at time n k { x) and have 

length l k {x) for k = 0, 1, • • •. We refer to this parsing as the block decomposition 
of a sequence. The fcth block, which begins at time n k (x), must either have 
sample entropy satisfying 



-himCaffig)) 

Ik (*^) 



^ h -|- 6 



or, equivalently, probability at least 



”*(<%) > e _Zfe(x)( - +€) , 



(3.9) 



(3.10) 
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or it must consist of only a single symbol. Blocks having length 1 (If. = 1) 
could have the correct sample entropy, that is, 



- lnm « t (x)) 

1 



< h + e, 



or they could be bad in the sense that they are the first symbol of a sequence 
with n > N; that is, 



n{T n ^x) > N, 



or, equivalently, 



T nk {x) x G B. 



Except for these bad symbols, each of the blocks by construction will have a 
probability which satisfies the above bound. 

Define for nonnegative integers n and positive integers l the sets 

S(n,l) = {x:m(X l n (x))>e~ 1 ^}, 



that is, the collection of infinite sequences for which (3.2.2) and (3.2.3) hold for 
a block starting at n and having length l. Observe that for such blocks there 
cannot be more than e^- +e ) distinct l-tuples for which the bound holds (lest 
the probabilities sum to something greater than 1). In symbols this is 

\\S(n, Z)|| < e ,( - +e) . (3.11) 



The ergodic theorem will imply that there cannot be too many single symbol 
blocks with n(T nk ^x) > N because the event has small probability. These 
facts will be essential to the proof. 

Even though we write n(x) as a function of the entire infinite sequence, we 
can determine its value by observing only the prefix x N of x since either there 
is an n < N for which n -1 \s\m{x n ) < h + e or there is not. Hence there is a 
function h(x N ) such that n(x) = h[x N ). Define the finite length sequence event 
C = {x N : h(x N ) = 1 and — In mix 1 ) > h + e}, that is, C is the collection of all 
./V-tuples x N that are prefixes of bad infinite sequences, sequences x for which 
n{x) > N . Thus in particular, 

x G B if and only if x N G C. (3-12) 



Now recall that we parse sequences of length L » N and define the set Gl 
of “good” L-tuples by 

L — N—l 

Gl = { xL '■ L_ N Mzf ) < <5}, 

i—0 

that is, Gl is the collection of all L-tuples which have fewer than 6(L — N) < 5L 
time slots i for which 'xf is a prefix of a bad infinite sequence. From (3.12) and 
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the ergodic theorem for stationary ergodic sources we know that m-a.e. we get 
an x for which 

1 n— 1 1 n— 1 r 

lim — V = lim — 1 b(T 1 x) = m(B) < (3.13) 

n—> oo 77, • ^ n — »oo 77, • ^ 2 

i — 0 i — 0 

From the definition of a limit, this means that with probability 1 we get an x 
for which there is an Lq = Lq(x) such that 

L — N—l 

L _ N l ci x i ) < S', for all L > L 0 . (3.14) 

i= o 

This follows simply because if the limit is less than 6/2, there must be an Lq so 
large that for larger L the time average is at least no greater than 26/2 = 6. We 
can restate (3.14) as follows: with probability 1 we get an x for which x L £ Gl 
for all but a finite number of L. Stating this in negative fashion, we have one of 
the key properties required by the proof: If x L £ Gl for all but a finite number 
of L, then x L cannot be in the complement G C L infinitely often, that is, 

m(x : x L € G c l i.o.) = 0. (3.15) 

We now change tack to develop another key result for the proof. For each 
L we bounded above the cardinality ||Gl|| of the set of good L-tuples. By 
construction there are no more than 6L bad symbols in an L-tuple in Gl and 
these can occur in any of at most 

(^j<e h2{5)L (3.16) 




places, where we have used Lemma 2.3.5. Eq. (3.16) provides an upper 
bound on the number of ways that a sequence in Gl can be parsed by the given 
rules. The bad symbols and the final N symbols in the L-tuple can take on 
any of the ||.A|| different values in the alphabet. Eq. (3.11) bounds the number 
of finite length sequences that can occur in each of the remaining blocks and 
hence for any given block decomposition, the number of ways that the remaining 
blocks blocks can be filled is bounded above by 



n 



Jk(x)(h+e) _ e Y/ 



. h(x)(h+e) _ e L(h+e) 



k:T n k( x ) x#B 



(3.17) 



regardless of the details of the parsing. Combining these bounds we have that 
\\Gl\\ < e h2 ^ L X UAH' 51 ' X llAy X e L ^- +e ^ = e h AS)L+(SL+N)\n\\A\\+L(h+e) 



or 

||Gi|| < g i (k+ e + ft -2('5) + (i5+x') In ||A||) 
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Since S satisfies (3.7)-(3.8), we can choose Li large enough so that ATln H-AH/Ia < 
e and thereby obtain 

||Gx|| < e i( - +4e) ; L > L 1 . (3.18) 

This bound provides the second key result in the proof of the lemma. We now 
combine (3.18) and (3.15) to complete the proof. 

Let Bl denote a collection of L-tuples that are bad in the sense of having 
too large a sample entropy or, equivalently, too small a probability; that is if 
x L € Bl, then 

m{x L ) < e ~ L( - +5e) 

or, equivalently, for any x with prefix x L 

h L (x) > h + 5e. 

The upper bound on ||Gl|| provides a bound on the probability of Bl f) Gl- 

m(B L P| G l ) = ^2 m(x L ) < ^ e ~ L (h.+^) 

x L eB L f]G L x l &G l 

< ||G L ||e- i( ^ +5e) < e~ eL . 

Recall now that the above bound is true for a fixed e > 0 and for all L > L\. 
Thus 

oo L\ — 1 oo 

J2m(B L f]G L )= E m(B L pGi)+ E m(B L pG L ) 

L — 1 L—l L—L\ 
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which completes the proof of the lemma and hence also proves Theorem 3.1.1 
for the special case of stationary ergodic measures. □ 



3.3 Stationary Nonergodic Sources 

Next suppose that a source is stationary with ergodic decomposition {my A £ 
A} and ergodic component function ip as in Theorem 1.8.3. The source will 
produce with probability one under m an ergodic component rri\ and Lemma 
3.2.2 will hold for this ergodic component. In other words, we should have that 

lim ——In mMX n ) = H m ,(X)\ m — a.e., (3.20) 

n — »oo Tl w 

that is, 

m({x : - Jarn^ In m^ x ) (x n ) = H m ^ x) (X)}) = 1. 

This argument is made rigorous in the following lemma. 

Lemma 3.3.1: Suppose that {X n } is a stationary not necessarily ergodic 
source with ergodic component function ip. Then 

m({x : - lirn^ In (x n ) = H m ^ x) {X)}) = 1; m - a.e.. (3.21) 

Proof: Let 



G= {x: ~ J^ln m^ x )(x n ) = H m ^ x) (X)} 

and let G\ denote the section of G at A, that is, 

G\ = {x : - lim lnm A (a; n ) = H mx (X)}. 

n — ^oo 

From the ergodic decomposition (e.g., Theorem 1.8.3 or [50], Theorem 8.5.1) 
and (1.26) 

m(G) = J dP^(X)m x (G), 

where 

m\(G) = m{G\ip = A) = m(G^\{x : ip{x) = \}\ip = A) 

= m(G\\ip = A) = m\(G\) 

which is 1 for all A from the stationary ergodic result. Thus 

m(G) = J dP A X)m x{ G x ) = 1. 

It is straightforward to verify that all of the sets considered are in fact measur- 
able. □ 

Unfortunately it is not the sample entropy using the distribution of the 
ergodic component that is of interest, rather it is the original sample entropy 
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for which we wish to prove convergence. The following lemma shows that the 
two sample entropies converge to the same limit and hence Lemma 3.3.1 will also 
provide the limit of the sample entropy with respect to the stationary measure. 

Lemma 3.3.2: Given a stationary source { X n }, let {my A £ A} denote 
the ergodic decomposition and if) the ergodic component function of Theorem 
1.8.3. Then 

lim 1 1„ = 0; m - ae. 

n—>oo n m(X n ) 

Proof: First observe that if m(a") is 0, then from the ergodic decomposition 
with probability 1 m 4 ,{a n ) will also be 0. One part is easy. For any e > 0 we 
have from the Markov inequality that 

. 1 m(X n ) , . m(X n ) ne . 

m (~ ln > e ) = \ V L > O 



K n m 4> (X n ) 



y m 4 ,{X n ) 



< F ( l,— 



The expectation, however, can be evaluated as follows: Let A„ — { 
m\{a n ) > 0}. Then 

/ m{X n ) \ _ 7 v-^ m(a n ) . 

Em \m^(X n ) J - J A ) J£ An m A (a«) mA(a } 



= J dP. 4 ,(\)m(A W) < 1, 

where P^, is the distribution of if. Thus 



and hence 



. 1 , m(X n ) 
m( — In > e) < e 



V- 1 1 i Tn(X n ) 

> m(- In , > e) < oo 

» m^X-) 



and hence from the Borel-Cantelli lemma 



.1 m(X n ) . 

™(- ln _ > e i-o.) = 0 



Ai m^,(X n ) 
and hence with m probability 1 



Since e is arbitrary, 



1, m(X n ) 

Inn sup - ln — — < e. 

n — >00 U TO-0 (aV ) 



1 , m(X n ) 

iimsup — in <0; to — a.e.. 

oo n m 4 ,(X n ) 



(3.22) 
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For later use we restate this as 



. c l, m^{X n ) ^ n 
lim mi — m — , — , > 0; m — a.e.. 



n—>oo n m 



(*") 



(3.23) 



We now turn to the converse inequality. For any positive integer k, we can 
construct a stationary fc-step Markov approximation to m as in Section 2.6, that 
is, construct a process rrS k ' 1 with the conditional probabilities 



m^(X n G F\X n ) = m^(X n G F \X k _ k ) 

= m(X n G F\X k n _ k ) 

and the same fctlr order distributions rrS k \X k G F) = m{X k G F). Consider 
the probability 

v „ l V n \ — > ' „„„ / V n.\ — ' 



i(X n ) 



»(*") 



/ ^ t m ^ k \X n ) 

^ E ™( , vn , )e 



m{X n ) 



The expectation is evaluated as 



E 



l (k \x n ) 



l{x n ) = 1 



m(x n ) 

X' " X 7 

and hence we again have using Borel-Cantelli that 

1, „ 

Inn sup - In . . < 0. 

n ^oo n m{X n ) 

We can apply the usual ergodic theorem to conclude that with probability 1 
under m 

lim sup 1 In * < lim - In 1 = E m [- In m(X k \X k )] . 

n^oo n m(X n ) n—>oo n m^ K >{X n ) 

Combining this result with (3.20) we have using Lemma 2.4.3 that 
lim sup - In < -H m ^(X) - E m ^[lnm(X k \X k )\. 



n m(X n ) 



1 1 m ( fc ) (-^0 • 



(3.24) 



This bound holds for any integer k and hence it must also be true that m- a.e. 
the following holds: 



1 TYl (X n ^\ - 

lim sup - In , ’ < inf H m M m (*o (X) = £. 

n ^oo n m(A") k 



(3.25) 
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In order to evaluate £ we apply the ergodic decomposition of relative entropy 
rate (Corollary 2.4.2) and the ordinary ergodic decomposition to write 

J dP^C = J dP 4 ,mt 



< inf 

k 



dP'ipHyn^ || m (fc) (X j inf H rn | | m (fe) {X ) . 



From Theorem 2.6.2, the right hand term is 0. If the integral of a 
function is 0, the integrand must itself be 0 with probability one. 
becomes 



lim sup — In 

n — »oo ft 



ny(X") 
m(X n ) 



< 0 , 



nonnegative 
Thus (3.25) 



which with (3.23) completes the proof of the lemma. □ 



We shall later see that the quantity 



i n (X n ;V>) = - In 
n 



m^(X n ) 

m(X n ) 



is the sample mutual information (in a generalized sense so that it applies to the 
usually non-discrete ip) and hence the lemma states that the normalized sample 
mutual information between the process outputs and the ergodic component 
function goes to 0 as the number of samples goes to infinity. 

The two previous lemmas immediately yield the following result. 

Corollary 3.3.1: The conclusions of Theorem 3.1.1 hold for sources that 
are stationary. 



3.4 AMS Sources 



The principal idea required to extend the entropy theorem from stationary 
sources to AMS sources is contained in Lemma 3.4.2. It shows that an AMS 
source inherits sample entropy properties from an asymptotically dominating 
stationary source (just as it inherits ordinary ergodic properties from such a 
source). The result is originally due to Gray and Kieffer [54], but the proof 
here is somewhat different. The tough part here is handling the fact that the 
sample average being considered depends on a specific measure. From Theorem 
1.7.1, the stationary mean of an AMS source dominates the original source on 
tail events, that is, events in Poo- We begin by showing that certain important 
events can be recast as tail events, that is, they can be determined by looking 
at only samples in the arbitrarily distant future. The following result is of this 
variety: It implies that sample entropy is unaffected by the starting time. 

Lemma 3.4.1: Let { X n } be a finite alphabet source with distribution m. 
Recall that XJt = {Xk,Xk+\, ■ • • ,Xk+ n ~\) and define the information density 



i{X k ; X£~ k ) = In 



m(X n ) 

m(X k )m(X ™- k ) ' 
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Then 



lim -i{X k -X*~ 

n— >oo 77 ' 



) = 0; to — a.e.. 



Comment: The lemma states that with probability 1 the per-sample mutual 
information density between the first k samples and future samples goes to zero 
in the limit. Equivalently, limits of n -1 him(X n ) will be the same as limits of 
n~ x \wm{X 1 f~ k ) for any finite k. Note that the result does not require even that 
the source be AMS. The lemma is a direct consequence of Lemma 2.7.1. 

Proof: Define the distribution p = mx k x mx k ,x k+1 ,— , that is, a distribution 
for which all samples after the first k are independent of the first k samples. 
Thus, in particular, p(X n ) = m{X k )m{X 1 f) . We will show that p >> m, in 
which case the lemma will follow from Lemma 2.7.1. Suppose that p(F) = 0. If 
we denote X £ = X^, X^+i, • • •, then 



0 = P(F) = ^ ~^m(x k )m x +(F xk ), 



where F x k is the section {x~£ : ( x k ,x jj") = x € F}. For the above relation to 
hold, we must have m x +(F x k) = 0 for all x k with m(x k ) ^ 0. We also have, 

k 

however, that 

m(F) =^2m{X k = a k , X+ <E F ak ) 

a k 

= = ak \ X k e F ak )m{X+ GF ak ). 

a k 

But this sum must be 0 since the rightmost terms are 0 for all a k for which 
m(X k = a k ) is not 0. (Observe that we must have m(X k = a k \X£ g F a k) = 
0 if m(X£ g F ak ) ^ 0 since otherwise m(X k = a k ) > m(X k = a k ,Xjf G F a k) 
> 0, yielding a contradiction.) Thus p » m and the lemma is proved. □ 

For later use we note that we have shown that a joint distribution is dom- 
inated by a product of its marginals if one of the marginal distributions is 
discrete. 

Lemma 3.4.2: Suppose that {X n } is an AMS source with distribution m 
and suppose that to is a stationary source that asymptotically dominates m 
(e.g., fh is the stationary mean). If there is an invariant function h such that 

lim ln?n(A") = h\ fh — a.e., 

n— ► oo 77, 



then also, 



lim In m(X n ) = h: to — a.e. 

n—> oo 77 



Proof: For any k we can write using the chain rule for densities 



-- In m(X n ) + - In 7n(X?~ k ) = --\nm(X k \X?- k ) 
n n n 
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= -~i(X k -,X?~ k ) - - In m(X k ). 
n k n 

From the previous lemma and from the fact that H m (X k ) = —E m \nm{X k ) is 
finite, the right hand terms converge to 0 as n — > oo and hence for any k 



lim — — \nm(X k \XV~ k ) 

n—> oo 77, K 



= lim (— — lnm(X") + — In m(X£ k )) = 0; to — a.e.. 



— lnm(X") + — 

n— >oo ' n n 

This implies that there is a subsequence k(n) —> oo such that 



-llnm(X k W\X™-^) 



(3.26) 



= — In m(X n ) In m(X?, (dn)) — > 0; m — a.e.. (3.27) 

n n K{ ' n) 

To see this, observe that (3.26) ensures that for each k there is an N(k) large 
enough so that N(k) > N(k — 1) and 



1 



m - 



.ZV(fc) 



lnm(X k \x" {k) - k )\ > 2~ k ) < 2~ k . 



Applying the Borel-Cantelli lemma implies that for any e, 

m( | - l/N{k)\nm{X k \X* {k) - k )\ > e i.o.) = 0. 
Now let k{n) = k for N[k ) < n < N(k + 1). Then 

m ( | - l/nlnm(A fc(n) |A^ ( " ) )| > e i.o.) = 0 



(3.28) 



and therefore 



lim ( lnm(X") 

n— >oo y n 



1 

n 



In m{X 




= 0; in — a.e. 



as claimed in (3.27). 

In a similar manner we can also choose the sequence so that 

lim f — — ln?h(X") + — lnm(Xw _ f"M =0; fh — a.e., 
n—>oo \ n n 1 ) 

that is, we can choose N(k) so that (3.4.3) simultaneously holds for both m and 
to. Invoking the entropy ergodic theorem for the stationary fh (Corollary 3.3.1) 
we have therefore that 

lim — — In fri(X?C^ n ^) = h; fh — a.e.. 
ra— > oo n K '- n> 



(3.29) 
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From Markov’s inequality (Lemma 4.4.3 of [50]) 



m(— — In in(X%) < — — In m(X%) — e) = m(- 



ft") 



> e ne ) 



< e 






m(xr fc ) 



E 

— ( n 

:m(x k 



mW~ k ) 



-K^n-k^ m ( X k k ) 



(*k) < 






Hence taking fc = k(n) and again invoking the Borel-Cantelli lemma we have 
that 

™(-En m ( X k[ n k(n) ) < -En ? m(X™~* {n) ) - e i.o.) = 0 
or, equivalently, that 



lim inf In 

n—>oo Jl 






> 0; to — a.e.. 



(3.30) 



Therefore from (3.29) 

liminf — — lnm(X?7^ n ^) > h; in — a.e.. (3.31) 

n — >00 ft K{TL) 

The above event is in the tail cr-field Too = D n &(X n , X n + i, ■ • •) since it can be 
determined from Xu n ), ■ ■ • for arbitrarily large n and since h is invariant. Since 
fh dominates ?n on the tail cr-field (Theorem 1.7.2), we have also 

liminf In m(X?, ^ n ■*) > h ; to — a.e. 

n — >oo n fc ( n ) _ 



and hence by (3.4.2) 

liminf In rn{X rl ) > h; to — a.e. 

n—*oo Jl 

which proves half of the lemma. 

Since 

?h( lim In fh(X n ) ^ h) = 0 

n—> oo Jl 

and since fh asymptotically dominates ?n (Theorem 1.7.1), given e > 0 there is 
a k such that 

to( lim — — In fh(X2) = h) >1 — e. 

n— »■ oo Jl 

Again applying Markov’s inequality and the Borel-Cantelli lemma as in the 
development of (3.29) we have that 

liminf — — In EE \ > 0; to — a.e, 
oo n m(X%) 
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which implies that 

m(limsup — — ln?n(A£) < h) > 1 — e 

n—> oo 

and hence also that 

m(limsup — — lnm(X") < h) > 1 — e. 

n—>oo Tl 

Since e can be made arbitrarily small, this proves that m-a.e. 

limsup — n _1 lnm(X") < h, 

which completes the proof of the lemma. □ 

The lemma combined with Corollary 3.3.1 completes the proof of Theorem 
3.1.1. □ 



3.5 The Asymptotic Equipartition Property 

Since convergence almost everywhere implies convergence in probability, Theo- 
rem 3.1.2 has the following implication: Suppose that {X n } is an AMS ergodic 
source with entropy rate H. Given e > 0 there is an N such that for all n > N 
the set 

G n = {x n : \n~ 1 h rL (x) -H | > e} 

= {x n : e ~ n(ii+e) < m(x n ) < 

has probability greater then 1 — e. Furthermore, as in the proof of the theorem, 
there can be no more than e n ( H+e ' ) n-tuples in G n . Thus there are two sets of n- 
tuples: a “good” set of approximately e nH n-tuples having approximately equal 
probability of e~ nH and the complement of this set which has small total prob- 
ability. The set of good sequences are often referred to as “typical sequences” 
in the information theory literature and in this form the theorem is called the 
asymptotic equipartition property or the AEP. 

As a first information theoretic application of an ergodic theorem, we con- 
sider a simple coding scheme called an “almost noiseless source code.” As we 
often do, we consider logarithms to the base 2 when considering specific coding 
applications. Suppose that a random process {X„} has a finite alphabet A with 
cardinality ||A|| and entropy rate H. Suppose that H < log ||A||, e.g., A might 
have 16 symbols, but the entropy rate is slightly less than 2 bits per symbol 
rather than log 16 = 4. Larger alphabets cost money in either storage or com- 
munication applications. For example, to communicate a source with a 16 letter 
alphabet sending one letter per second without using any coding and using a 
binary communication system we would need to send 4 binary symbols (or four 
bits) for each source letter and hence 4 bits per second would be required. If 
the alphabet only had 4 letters, we would need to send only 2 bits per second. 
The question is the following: Since our source has an alphabet of size 16 but 
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an entropy rate of less than 2, can we code the original source into a new source 
with an alphabet of only 4 letters so as to communicate the source at the smaller 
rate and yet have the receiver be able to recover the original source? The AEP 
suggests a technique for accomplishing this provided we are willing to tolerate 
occasional errors. 

We construct a code of the original source by first picking a small e and 
a S small enough so that H + 5 < 2. Choose a large enough n so that the 
AEP holds giving a set G n of good sequences as above with probability greater 
than 1 — e. Index this collection of fewer than 2 n ^ H+s ' ) < 2 2 " sequences using 
binary 2n-tuples. The source Xk is parsed into blocks of length n as X'j} n = 
( Xkn , Xkn+i, • • • , X(k+i) n ) and each block is encoded into a binary 2n-tuple as 
follows: If the source n-tuple is in G n , the codeword is its binary 2n-tuple index. 
Select one of the unused binary 2n-tuples as the error index and whenever an 
n-tuple is not in G n , the error index is the codeword. The receiver or decoder 
than uses the received index and decodes it as the appropriate n-tuple in G n . If 
the error index is received, the decoder can declare an arbitrary source sequence 
or just declare an error. With probability at least 1 — e a source n-tuple at 
a particular time will be in G n and hence it will be correctly decoded. We 
can make this probability as small as desired by taking n large enough, but we 
cannot in general make it 0. 

The above simple scheme is an example of a block coding scheme. If con- 
sidered as a mapping from sequences into sequences, the map is not stationary, 
but it is block stationary in the sense that shifting an input block by n results 
in a corresponding block shift of the encoded sequence by 2 n binary symbols. 




Chapter 4 



Information Rates I 



4.1 Introduction 

Before proceeding to generalizations of the various measures of information, 
entropy, and divergence to nondiscrete alphabets, we consider several properties 
of information and entropy rates of finite alphabet processes. We show that 
codes that produce similar outputs with high probability yield similar rates and 
that entropy and information rate, like ordinary entropy and information, are 
reduced by coding. The discussion introduces a basic tool of ergodic tlreory- 
the partition distance-and develops several versions of an early and fundamental 
result from information theory-Fano’s inequality. We obtain an ergodic theorem 
for information densities of finite alphabet processes as a simple application of 
the general Shannon-McMillan-Breiman theorem coupled with some definitions. 
In Chapter 6 these results easily provide L 1 ergodic theorems for information 
densities for more general processes. 

4.2 Stationary Codes and Approximation 

We consider the behavior of entropy when codes or measurements are taken on 
the underlying random variables. We have seen that entropy is a continuous 
function with respect to the underlying measure. We now wish to fix the measure 
and show that entropy is a continuous function with respect to the underlying 
measurement. 

Say we have two finite alphabet measurements / and g on a common prob- 
ability space having a common alphabet A. Suppose that Q and 1Z are the 
corresponding partitions. A common metric or distance measure on partitions 
in ergodic theory is 

\Q-n = ^E p (^ A ^)> (4- 1 ) 

i 

which in terms of the measurements (assuming they have distinct values on dis- 
tinct atoms) is just Pr(/ ^ g). If we consider / and g as two codes on a common 
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space, random variable, or random process (that is, finite alphabet mappings), 
then the partition distance can also be considered as a form of distance between 
the codes. The following lemma shows that entropy of partitions or measure- 
ments is continuous with respect to this distance. The result is originally due 
to Fano and is called Fano’s inequality [37] . 

Lemma 4.2.1: Given two finite alphabet measurements / and g on a com- 
mon probability space ( Cl,B,P ) having a common alphabet A or, equivalently, 
the given corresponding partitions Q = {f~ 1 (a)\a € A} and 1Z = {g~ l {a)-,a € 
A }, define the error probability P e = \Q — 1Z\ = Pr (/ ^ g). Then 

H(f\g)<h 2 (P e ) + P e H\\A\\-l) 

and 

I H(f) - H(g) I < h 2 (Pe) + PeHM - 1) 

and hence entropy is continuous with respect to partition distance for a fixed 
measure. 

Proof: Let M = ||A|| and define a measurement 

r : A x A -» {0, 1, • • • , M — 1} 

by r(a, b) = 0 if a = b and r(a, b) = i if a ^ b and a is the itli letter in the 
alphabet Ab = A — b. If we know g and we know r(/, g), then clearly we know 
/ since either / = g (if r{f 1 g) is 0) or, if not, it is equal to the r(f,g)th letter 
in the alphabet A with g removed. Since / can be considered a function of g 
and r(f,g), 

H(f\g,r(f,g)) = 0 

and hence 



H(f,g,r(f,g)) = H(f\g,r(f,g)) + H(g,r{f,g)) = H(g,r{f,g)). 



Similarly 



From Lemma 2.3.2 



H{f,g,r{f,g)) = H{f,g). 



H(f,g) = H(g,r(f,g)) < H(g) + H{r(f , g)) 



or 



H(f,g) - H(g) = H(f\g) < H(r(f,g)) 



M—l 



= —P(r = 0) In P(r = 0) — P(r = i) In P(r = i). 



i= 1 



Since P(r = 0) = 1 — P e and since P( r = i) = Pe, this becomes 



H(f\g) < -(1 - P e ) lll(l - P e ) -PeY, ~ J ^ ~ J 



— Pf. In Pe 
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< h 2 {P e ) + P e In (M - 1) 

since the entropy of a random variable with an alphabet of size M — 1 is no 
greater than ln(M — 1). This proves the first inequality. Since H(f) < H(f 1 g) = 
H(f\g) + H(g), this implies 

H(f) - H(g) < h 2 (P e ) + P e In (M - 1). 

Interchanging the roles of / and g completes the proof. □ 

The lemma can be used to show that related information measures such 
as mutual information and conditional mutual information are also continuous 
with respect to the partition metric. The following corollary provides useful 
extensions. Similar extensions may be found in Csiszar and Korner [26]. 

Corollary 4.2.1: Given two sequences of measurements {/„} and {g n } with 
finite alphabet A on a common probability space, define 

1 n— 1 

^ = -£ p r(/^)- 

U 2= 0 



Then 

^H(r\g n ) < H n) Indian - 1) + h 2 (p, W) 

and 

I - ±H(g n ) | < P^H\\A\\ - 1 ) + h 2 (pW). 

If {f n ,g n } are also AMS and hence the limit 

P e = lim P e (ra) 

n — >-oo 

exists, then if we define 

H{f\g)= lim -H(f n \g n )= lim g n ) - H(g n )), 

n — »oo 77, n —> oo fl 

where the limits exist since the processes are AMS, then 

H(f\g)<P e ln(\\A\\-l) + h 2 (P e ) 

I 3(f) - H(g ) | < P e ln(| |A| | - 1) + h 2 (P e ). 

Proof: From the chain rule for entropy (Corollary 2.5.1), Lemma 2.5.2, and 
Lemma 4.2.1 

n — 1 n — 1 

H{f n \g n ) = Y. H Ui\f\g n ) < £^(/il^) 

2—0 2=0 

n — 1 n — 1 

< £ H(fi\ gi ) < J2 ( Pr (/i ^ 9i ) ln(||A|| - 1) + h 2 (Pr(fi ^ 9i ))) 

2 = 0 2=0 
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from the previous lemma. Dividing by n yields the first inequality which im- 
plies the second as in the proof of the previous lemma. If the processes are 
jointly AMS, then the limits exist and the entropy rate results follows from the 
continuity of /12 by taking the limit. □ 

The per-symbol probability of error P e ; has an alternative form. Recall 
that the (average) Hamming distance between two vectors is the number of 
positions in which they differ, i.e. , 

d { ^(x 0 ,y 0 ) = 1 - S XQ>yo , 

where S a ,b is the Kronecker delta function (0 if a = b and 1 otherwise), and 

71—1 

dH ] {x n ,y n ) = ^d^\x u yi). 

i=0 

We have then that 

P™=E(±d%\r,g n )y 

the normalized average Hamming distance. 

The next lemma and corollary provide a useful tool for approximating com- 
plicated codes by simpler ones. 

Lemma 4.2.2: Given a probability space (0,B,P) suppose that IF is a 
generating field: B = a{T). Suppose that S-measurable Q is a partition of fi 
and e > 0. Then there is a partition Q' with atoms in T such that \Q — Q'\ < e. 

Proof: Let ||A|| = K . From Theorem 1.2.1 given 7 > 0 we can find sets 
Ri £ T such that P(Q,:Ai?,;) < 7 for i = 1, 2, • • • , K — 1. The remainder of the 
proof consists of set theoretic manipulations showing that we can construct the 
desired partition from the Ri by removing overlapping pieces. The algebra is 
given for completeness, but it can be skipped. Form a partition from the sets 
as 

i - 1 

Q'i = Ri - |J -Rj, * = 1, 2, • • ■ ,K — 1 

3 = 1 

K-l 

Q’k = ( U Q’iY- 

For 1 < K 

P(Q l AQ / i ) = P(Q, |J 0%) - P(Q, p) Q'f) 

< P(Q, |J Ri) - P(Q t p|(Pi - (J Rj)). 

j<i 

The rightmost term can be written as 

p ( q , n(* - u r i )) = p ((^ n - (U q < n r - n 

j<i j<i 



(4.2) 
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= p{Qi n R i) - p ( u q* n r ‘ n (4.3) 

3 <i 

where we have used the fact that a set difference is unchanged if the portion 
being removed is intersected with the set it is being removed from and we have 
used the fact that P(F — G) = P{F) — P(G) if G C F. Combining (4.2.2) and 
(4.2.3) we have that 

P(Q t AQ\) < P(Qi U Ri) - P(Qi p| Ri) + P( |J Q % f| R, f| Rj) 

3<i 

= P(QiARi) + P( U Qi p) R, |p Rj) < 7 + X] p (Q‘ Pi R j)- 

j<i j<i 

For j ^ i, however, we have that 

P(Q, p| Rj) = P(Q, p) Rj p Qj) < P(Rj p Qj) 

< P(RjAQj) < 7, 

which with the previous equation implies that 

P(QiAQ'i) < Kr, i = 1, 2, • • • , K - 1. 

For the remaining atom: 

P(Q k AQ' k ) = P(Q k p Q ic k p Q c k p Q’ k ). (4.4) 

We have 

qk p q ,c k = qk p( u Q’j) = q k p( u $3 n Qi). 

j<K j<K 

where the last equality follows since points in Q' .■ that are also in Qj can- 
not contribute to the intersection with Qk since the Qj are disjoint. Since 
Q'j fl Qj C Q'jAQj we have 

Qk P Q'k C Qk ncu Q'jAQj) c P Q'jAQj. 

j<K j<K 

A similar argument shows that 

Qk pOVc P Q'jAQj 

j<k 

and hence with (4.4) 

P(Q k AQ' k ) < P( P QjAQ'j) < p (QjAQ'j) < K 2 7 . 

3 <K j<K 




70 



CHAPTER 4. INFORMATION RATES I 



To summarize, we have shown that 



P(Q,XQ' i ) < K 2 r , i = 1,2, - ■ ■ ,K 

If we now choose 7 so small that K 2r y < e/K , the lemma is proved. □ 

Corollary 4.2.2: Let ( Cl,B,P ) be a probability space and T a generating 
field. Let / : O — » A be a finite alphabet measurement. Given e > 0 there 
is a measurement g : f l —> A that is measurable with respect to T (that is, 
g~ 1 (a) € T for all a £ A) for which P(f ^ g) < e. 

Proof: Follows from the previous lemma by setting Q = {f~ 1 (a); a £ A}, 
choosing Q' from the lemma, and then assigning g for atom Q' i in Q! the same 
value that f takes on in atom Qi in Q. Then 

P(f^g) = \ y £ p (Qi A Q , i^ e - D 

i 

We now develop applications of the previous results which relate the idea of 
the entropy of a dynamical system with the entropy rate of a random process. 
The result is not required for later coding theorems, but it provides insight into 
the connections between entropy as considered in ergodic theory and entropy as 
used in information theory. In addition, the development involves some ideas of 
coding and approximation which are useful in proving the ergodic theorems of 
information theory used to prove coding theorems. 

Let {X n } be a random process with alphabet Ax- Let A f denote the one or 
two-sided sequence space. Consider the dynamical system (f l,B,P,T) defined 
by (A™, B(Ax)°°, P, T), where P is the process distribution and T the shift. 
Recall from Section 2.2 that a stationary coding or infinite length sliding block 
coding of {X n } is a measurable mapping / : A f — > A / into a finite alphabet 
which produces an encoded process {f n } defined by 

f n (x) = f(T n x)-, X£A%. 

The entropy H(P, T) of the dynamical system was defined by 

H(P,T) = sup H P (f), 
f 

the supremum of the entropy rates of finite alphabet stationary codings of the 
original process. We shall soon show that if the original alphabet is finite, then 
the entropy of the dynamical system is exactly the entropy rate of the process. 
First, however, we require several preliminary results, some of independent in- 
terest. 

Lemma 4.2.3: If / is a stationary coding of an AMS process, then the 
process {/„} is also AMS. If the input process is ergodic, then so is {/„}. 

Proof: Suppose that the input process has alphabet Ax and distribution P 
and that the measurement / has alphabet Af. Define the sequence mapping 
/ : Af -> Af by f(x) = {/„(x); n € T}, where f n (x) = f(T n x) and T is 
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the shift on the input sequence space A If T also denotes the shift on the 
output space, then by construction f{Tx) = Tf(x) and hence for any output 
event F, f~ 1 (T~ l F) = T _1 / _1 (F). Let m denote the process distribution for 
the encoded process. Since m(F) = P(/” 1 (F)) for any event F £ B{Af)°° , we 
have using the stationarity of the mapping / that 



lim -V m(T~ l F) = lim -V P{f~ 1 {T~ i F)) 

n —> oo 77, ' ^ n —> oo 77, • ^ 

2=0 2=0 



1 71—1 

= lim -^P(T- i /- 1 (F)) = P(/- 1 (F)), 

n—*oo 77, z ' 

2 = 0 

where P is the stationary mean of P. Thus m is AMS. If G is an invariant 
output event, then f~ 1 (G) is also invariant since T -1 / -1 (G) = f~ 1 (T~ 1 G). 
Hence if input invariant sets can only have probability 1 or 0, the same is true 
for output invariant sets. □ 

The lemma and Theorem 3.1.1 immediately yields the following: 

Corollary 4.2.3: If / is a stationary coding of an AMS process, then 

H(f) = lim 

n — ► oo 77, 



that is, the limit exists. 

For later use the next result considers general standard alphabets. A sta- 
tionary code / is a scalar quantizer if there is a map q : Ax — » Af such that 
f{x) = q( xo). Intuitively, / depends on the input sequence only through the 
current symbol. Mathematically, / is measurable with respect to ct(Xq). Such 
codes are effectively the simplest possible and have no memory or dependence 
on the future. 

Lemma 4.2.4: Let {X n } be an AMS process with standard alphabet Ax 
and distribution in. Let / be a stationary coding of the process with finite 
alphabet Af. Fix e > 0. If the process is two-sided, then there is a scalar 
quantizer q : Ax — ► A q , an integer N, and a mapping g : A ^ — > Af such that 

^ 72—1 

lim - V'PrCf, ^ 9(q( x i-N),q(Xi- N+1 ), - ■ ■ ,q(X i+N ))) <e. 

72 — KDO 77, z 

2=0 

If the process is one-sided, then there is a scalar quantizer q : Ax — > A q , an 
integer N, and a mapping g : A q — > A f such that 

^ 72—1 

lim - V Pr(/j ± g(q(Xi), q(X i+1 ), ■■■, q{X i+N _ i))) < e. 

72—^00 77, z 

2=0 

Comment: The lemma states that any stationary coding of an AMS process can 
be approximated by a code that depends only on a finite number of quantized 
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inputs, that is, by a coding of a finite window of a scalar quantized version of 
the original process. In the special case of a finite alphabet input process, the 
lemma states that an arbitrary stationary coding can be well approximated by 
a coding depending only on a finite number of the input symbols. 

Proof: Suppose that to is the stationary mean and hence for any measure- 
ments / and g 

^ n— 1 

m(fo ± 9o) = lim - ^ Pr (/* ^ gf). 

n—> oo 77, z ' 
n — 0 

Let q n be an asymptotically accurate scalar quantizer in the sense that a(q n (X 0 )) 
asymptotically generates B(Ax)- (Since Ax is standard this exists. If Ax is 
finite, then take q(a) = a.) Then T n = a(q n (Xi); i = 0 , 1 , 2 , • • • , n — 1 ) asymp- 
totically generates B(Ax)°° for one-sided processes and T n = a(q n (Xi); i = 
— n, • • • ,n) does the same for two-sided processes. Hence from Corollary 4 . 2.2 
given e we can find a sufficiently large n and a mapping g that is measurable 
with respect to T n such that rh(f ^ g) < e. Since g is measurable with respect 
to T n , it must depend on only the finite number of quantized samples that 
generate T n . (See, e.g., Lemma 5 . 2.1 of [ 50 ].) This proves the lemma. □ 

Combining the lemma and Corollary 4.2.1 immediately yields the following 
corollary, which permits us to study the entropy rate of general stationary codes 
by considering codes which depend on only a finite number of inputs (and hence 
for which the ordinary entropy results for random vectors can be applied). 

Corollary 4.2.4: Given a stationary coding / of an AMS process let T n be 
defined as above. Then given e > 0 there exists for sufficiently large n a code g 
measurable with respect to T n such that 

\H(f) - H{g)\ < e. 

The above corollary can be used to show that entropy rate, like entropy, 
is reduced by coding. The general stationary code is approximated by a code 
depending on only a finite number of inputs and then the result that entropy is 
reduced by mapping (Lemma 2.3.3) is applied. 

Corollary 4.2.5: Given an AMS process {X n } with finite alphabet Ax and 
a stationary coding / of the process, then 

H(X) > H(f), 

that is, stationary coding reduces entropy rate. 

Proof: For integer n define T n = a(X 0 ,X i, • • • ,X n ) in the one-sided case 
and cr(X- n , • • • , X n ) in the two-sided case. Then T n asymptotically generates 
£>(A.y)°°. Hence given a code / and an e > 0 we can choose using the finite 
alphabet special case of the previous lemma a large k and a ^-measurable code 
g such that | H(f) — H(g)\ < e. We shall show that H(g) < H(X), which will 
prove the lemma. To see this in the one-sided case observe that g is a function 
of X k and hence g n depends only on X n+k and hence 

H(g n ) < H{X n+k ) 
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and hence 



H(g)= lim -H(g n ) < lim -—^—H{X n+k ) = H{X). 

n—>oc n n—*o o Tl 71 + k 

In the two-sided case g depends on {X_ k , ■ ■ ■ , X k } and hence g n depends on 
{X_ k , • • • , X n+k } and hence 

H(g n ) < H(X_ k , ■ ■ ■ , X_ 1; X 0 , ■ ■ ■ , X n+k ) < H(X_ k , • • • , X_i) + H(X n+k ). 

Dividing by n and taking the limit completes the proof as before. □ 

Theorem 4.2.1: Let {X n } be a random process with alphabet A\ ■ Let 
A ™ denote the one or two-sided sequence space. Consider the dynamical system 
(Q, B , P, T) defined by (A^, B{Ax)°° 7 P, T), where P is the process distribution 
and T is the shift. Then 

H (P, T) = H(X). 

Proof: From (2.2.4), H(P,T ) > H(X). Conversely suppose that / is a code 

which yields H(f) > H{P,T) — e. Since / is a stationary coding of the process 
{X n }, the previous corollary implies that H(f) < H(X), which completes the 
proof. □ 



4.3 Information Rate of Finite Alphabet Pro- 
cesses 

Let {( X n ,Y n )} be a one-sided random process with finite alphabet Ax B and 
let ((A x B) z + ,B(A x B ) z + ) be the corresponding one-sided sequence space of 
outputs of the pair process. We consider X n and Y n to be the sampling functions 
on the sequence spaces A°° and B°° and (X n ,Y n ) to be the pair sampling 
function on the product space, that is, for (x,y) € A°° x B°° , (X n ,Y n )(x,y) 
= (X n (x),Y n (y)) = (x n ,y n ). Let p denote the process distribution induced by 
the original space on the process {( X n ,Y n )}. Analogous to entropy rate we 
can define the mutual information rate (or simply information rate) of a finite 
alphabet pair process by 

I(X,Y) = lim sup -I(X n ,Y n ). 

n — »oo Tl 

The following lemma follows immediately from the properties of entropy rates 
of Theorems 2.4.1 and 3.1.1 since for AMS finite alphabet processes 

I(X- Y) = H(X) + H(Y) - H{X, Y) 

and since from (3.1.4) the entropy rate of an AMS process is the same as that of 
its stationary mean. Analogous to Theorem 3.1.1 we define the random variables 
p(X n ,Y n ) by p(X n ,Y n )(x,y) = p(X n = x n ,Y n = y n ), p{X n ) by p(X n )(x,y) 
= p(X n = x n ), and similarly for p(Y n ). 
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Lemma 4.3.1: Suppose that {X n ,Y n } is an AMS finite alphabet random 
process with distribution p and stationary mean p. Then the limits supremum 
defining information rates are limits and 



I P (X,Y)=I P (X,Y). 



Ip is an affine function of the distribution p. If p has ergodic decomposition p xy , 
then 

I P (X,Y) = I dp(x,y)I Pxy (X,Y). 

If we define the information density 



i n (X n ,Y n ) = In 



p(X n , Y n ) 
p(X n )p{Y n ) ' 



then 

lim 1 i n (X n ,Y n )=I Pxy (X,Y) 

n—*oo fl 

almost everywhere with respect to p and p and in L [ (p). 

The following lemmas follow either directly from or similarly to the corre- 
sponding results for entropy rate of the previous section. 

Lemma 4.3.2: Suppose that {X n , Y n , X' n , Y' n } is an AMS process and 



P = lim - V Pr((Xj, Yi) ± (X'i, Y ' f )) < e 

n—> oo 71 z ' 



i=0 



(the limit exists since the process is AMS). Then 



I ~I{X-Y) - I{X’- Y ') I < 3(eln(||A|| - 1) + h 2 (e)). 



Proof: The inequality follows from Corollary 4.2.1 since 



~\{x-Y)-i{x'-X ) I < 

MX) - H{X ') I + I H(Y) - H(Y')\ + I H{X,Y) - H{X' ,Y') \ 

and since Pr((Xj,l)) 7 ^ ( X/,Y. /)) = Pr (JQ 7 ^ X/ or Yi 7 ^ Y/) is no smaller 
than Pr(Xj 7 ^ X/) or Pr(^ ; 7 ^ Yf). □ 

Corollary 4.3.1: Let {X n ,Y n } be an AMS process and let / and g be 
stationary measurements on X and Y, respectively. Given e > 0 there is an 
N sufficiently large, scalar quantizers q and r, and mappings f and g' which 
depend only on {g(X 0 ), • • • , q(X N _ 1 )} and (r(F 0 )) • ■ • , r(Ljv_i)} in the one-sided 
case and {( 7 (X_at), • • • , q(XN)} and {r(y_jv), ■ ■ • , r(Yjv)} hr the two-sided case 
such that 

m-,9) -/(/'; 9 ') I <e- 

Proof: Choose the codes f and g' from Lemma 4.2.4 and apply the previous 
lemma. □ 
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Lemma 4.3.3:If {X n ,Y„} is an AMS process and / and g are stationary 
codings of X and Y, respectively, then 

I(X-Y)>I(f-g). 

Proof: This is proved as Corollary 4.2.5 by first approximating / and g by finite- 
window stationary codes, applying the result for mutual information (Lemma 
2.5.2), and then taking the limit. □ 
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Chapter 5 



Relative Entropy 



5.1 Introduction 

A variety of information measures have been introduced for finite alphabet ran- 
dom variables, vectors, and processes: entropy, mutual information, relative en- 
tropy, conditional entropy, and conditional mutual information. All of these 
can be expressed in terms of divergence and hence the generalization of these 
definitions to infinite alphabets will follow from a general definition of diver- 
gence. Many of the properties of generalized information measures will then 
follow from those of generalized divergence. 

In this chapter we extend the definition and develop the basic properties 
of divergence, including the formulas for evaluating divergence as expectations 
of information densities and as limits of divergences of finite codings. We also 
develop several inequalities for and asymptotic properties of divergence. These 
results provide the groundwork needed for generalizing the ergodic theorems of 
information theory from finite to standard alphabets. The general definitions 
of entropy and information measures originated in the pioneering work of Kol- 
mogorov and his colleagues Gelfand, Yaglom, Dobrushin, and Pinsker [45] [90] 
[32] [125], 

5.2 Divergence 

Given a probability space (O ,B,P) (not necessarily with finite alphabet) and 
another probability measure M on the same space, define the divergence of P 
with respect to M by 

D(P\\M) = supff P || M (Q) = sup D(P f \\M f ), (5.1) 

e / 

where the first supremum is over all finite measurable partitions Q of and the 
second is over all finite alphabet measurements on O. The two forms have the 
same interpretation: the divergence is the supremum of the relative entropies 
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or divergences obtainable by finite alphabet codings of the sample space. The 
partition form is perhaps more common when considering divergence per se, 
but the measurement or code form is usually more intuitive when considering 
entropy and information. This section is devoted to developing the basic proper- 
ties of divergence, all of which will yield immediate corollaries for the measures 
of information. 

The first result is a generalization of the divergence inequality that is a trivial 
consequence of the definition and the finite alphabet special case. 

Lemma 5.2.1: The Divergence Inequality: 

For any two probability measures P and M 

D(P\\M) > 0 



with equality if and only if P = M. 

Proof: Given any partition Q , Theorem 2.3.1 implies that 



E p (<?) ln 

QeQ 



P{Q) 

M(Q ) 



> 0 



with equality if and only if P(Q) = M(Q) for all atoms Q of the partition. Since 
D(P\\Q) is the supremum over all such partitions, it is also nonnegative. It can 
be 0 only if P and M assign the same probabilities to all atoms in all partitions 
(the supremum is 0 only if the above sum is 0 for all partitions) and hence the 
divergence is 0 only if the measures are identical. □ 

As in the finite alphabet case, Lemma 5.2.1 justifies interpreting divergence 
as a form of distance or dissimilarity between two probability measures. It is 
not a true distance or metric in the mathematical sense since it is not symmetric 
and it does not satisfy the triangle inequality. Since it is nonnegative and equals 
zero only if two measures are identical, the divergence is a distortion measure 
as considered in information theory [51], which is a generalization of the notion 
of distance. This view often provides interpretations of the basic properties of 
divergence. We shall develop several relations between the divergence and other 
distance measures. The reader is referred to Csiszar [25] for a development of 
the distance-like properties of divergence. 

The following two lemmas provide means for computing divergences and 
studying their behavior. The first result shows that the supremum can be con- 
fined to partitions with atoms in a generating field. This will provide a means 
for computing divergences by approximation or limits. The result is due to 
Dobrushin and is referred to as Dobrushin’s theorem. The second result shows 
that the divergence can be evaluated as the expectation of an entropy density 
defined as the logarithm of the Radon-Nikodym derivative of one measure rela- 
tive to the other. This result is due to Gelfand, Yaglom, and Perez. The proofs 
largely follow the translator’s remarks in Chapter 2 of Pinsker [125] (which in 
turn follows Dobrushin [32]). 

Lemma 5.2.2: Suppose that (f l,B) is a measurable space where B is gen- 
erated by a field T ,B — <j(T). Then if P and M are two probability measures 
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on this space, 



D(P\\M)= sup H PllM (Q). 

Q.CJ 7 



Proof: From the definition of divergence, the right-hand term above is clearly 
less than or equal to the divergence. If P is not absolutely continuous with 
respect to M, then we can find a set F such that M(F) = 0 but P(F) yf 0 and 
hence the divergence is infinite. Approximating this event by a field element Fq 
by applying Theorem 1.2.1 simultaneously to M and G will yield a partition 
{FojF’q} for which the right hand side of the previous equation is arbitrarily 
large. Hence the lemma holds for this case. Henceforth assume that M » P. 

Fix e > 0 and suppose that a partition Q = {Qi, • • • , Qk} yields a relative 
entropy close to the divergence, that is, 

Hp\\m(Q) = E P(Qi) In > D(P\\M) - e/2. 



We will show that there is a partition, say Q! with atoms in F which has 
almost the same relative entropy, which will prove the lemma. First observe that 
P{Q) ln[P(Q) /M (Q)} is a continuous function of P(Q ) and M(Q ) in the sense 
that given e/(2 1\) there is a sufficiently small S > 0 such that if \P(Q) — P(Q')\ < 
S and | M(Q) — M(Q')\ < S, then provided M(Q) yf 0 



\P(Q) In 



P(Q) 

M(Q) 



P(Q') In 



pm 

M(Q') 



< 



e 

2K' 



If we can find a partition Q' with atoms in T such that 



\pm - P(Qi)\ < 6, \M(Q' i ) — M(Qi)\ < 5, i=l,---,K, (5.2) 

then 

\Hp\\m(Q') - H Pm {Q)\ < E I P(Qi) In - Pm In ^^1 



and hence 

H p \\ m {Q!) > D(P\\M) - e 

which will prove the lemma. To find the partition Q' satisfying (5.2), let m be 
the mixture measure P/2 + M/2. As in the proof of Lemma 4.2.2, we can find a 
partition Q' C T such that m{QiNQ' i ) < K 2 ^ for i = 1, 2, • • • , K, which implies 
that 

P(Q i AQ' i )<2K 2 T , i = 1,2,..-, A/ 

M(Q,;AQ') < 2K 2 r , i = 1, 2, • • • , K. 



and 
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If we now choose 7 so small that 2K 2 j < 6, then (5.2.2) and hence the lemma 
follow from the above and the fact that 



\P{F) - P{G) | < P(FAG).D (5.3) 

Lemma 5.2.3: Given two probability measures P and M on a common 
measurable space if P is not absolutely continuous with respect to M, 

then 

D(P\\M) = 00 . 

If P « M (e.g., if D(P\\M) < 00 ), then the Radon-Nikodym derivative / = 
dP/dM exists and 



D(P\\M) = J]nf(u)dP(u) = J f(u)inf(u)dM(u). 



The quantity In / (if it exists) is called the entropy density or relative entropy 
density of P with respect to M. 

Proof: The first statement was shown in the proof of the previous lemma. If 
P is not absolutely continuous with respect to M, then there is a set Q such that 
M(Q) = 0 and P(Q) > 0. The relative entropy for the partition Q = {Q,Q C } 
is then infinite, and hence so is the divergence. 

Assume that P « M and let / = dP/dM. Suppose that Q is an event for 
which M ( Q ) > 0 and consider the conditional cumulative distribution function 
for the real random variable / given that u £ Q: 



f q{u) 



M({f < u}PlQ) 
M(Q) 



u G 



(— 00 , 00 ). 



Observe that the expectation with respect to this distribution is 



EM\Q) = («) = jTjj = 5§. 

We also have that 

u In u dF Q (u) = ^ f(uj) In f(oj) dM (w) , 

where the existence of the integral is ensured by the fact that u In u > — e _1 . 

Applying Jensen’s inequality to the convex (J function a In u yields the in- 
equality 

MO) = m(q) j Q minf{u)dM{u) = ^ ulnudF Q (u) 




>[ udF Q (u)} ln[ / udF Q {u) 
Jo Jo 



P(Q) , p(Q) 

M(Q) M(Q ) ' 
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We therefore have that for any event Q with M(Q) > 0 that 



In /(w) dP(u) > P(Q) In 



M(Q) ■ 



Now let Q = {Qi} be a finite partition and we have 

J hr /(w)cLP(w) = y, j In /(w) cLP(w) 

i Qi 

> y. I inmdPM = Y p m'«w§t r 

i:P(Qi)^0 Qi i W ’ 

where the inequality follows from (5.2.4) since P(Qi) ^ 0 implies that M(Qi ) ^ 
0 since M >> P. This proves that 



D(P\\M)< Jlnf{tu) dP(w). 



To obtain the converse inequality, let q n denote the asymptotically accurate 
quantizers of Section 1.6. From (1.6.3) 

f In f(u) dP{u) = lim [ q n (ln f(w)) dP{u). 

J n^oc J 

For fixed n the quantizer q n induces a partition of O into 2n2 n + 1 atoms 
Q. In particular, there are 2n2 n — 1 “good” atoms such that for to, u>' inside 
the atoms we have that |ln/(u;) — ln/(o/)| < The remaining two 

atoms group u> for which In f(uj) > n or ln/(w) < —n. Defining the shorthand 
P(ln/ < — n) = P({u> : In f(u>) < — n}), we have then that 

\ t>( r A i„ p (Q) \ ' P(Q) 



£ p(0)1 "FiI = S p <e»" 



good Q 



M(Q) 



+P( k / > n) 1„ "> + P(k / < -n) 1» 

v ’ M (In / > n) K ’ M (In / < — n) 

The rightmost two terms above are bounded below as 

P(ln / > „) In P ^ S r"\ + P(ln / < -») 1„ 

v 1 M (In / > n) v ’ M(ln / < —n) 

> P(ln / > n) In P(ln / > n) + P(ln / < — n) In P(ln / < —n). 

Since P(ln/ > n) and P(ln/ < — n) — » 0 as n — > oo and since x In x — > 0 as 
a: — > 0, given e we can choose n large enough to ensure that the above term is 
greater than — e. This yields the lower bound 

V6S good Q 




82 



CHAPTER 5. RELATIVE ENTROPY 



Fix a good atom Q and define h = sup wg Q In /(w) and h = inf^gQ In f(u>) 
and note that by definition of the good atoms 

h-h< 2~( n ~ 1 \ 



We now have that 



and 



P(Q)h> / In f(u>) dP(u>) 



M(Q)e- < f f{w)dM{<jj) = P(Q). 

Jq 

Combining these we have that 



P(Q) 



P(Q ) 



P(Q ) ln T7^ > P(Q ) ln = P(Q)h 



M(Q) 



P{QY 



>P(Q)(h- 2- (n ~ 1) )> f In f(u)dP(u>) -P(Q)2- (n - 1} . 
Jq 



Therefore 



£ p(0)h ’lr§ a S 



QeQ 



good Q 



M(Q) 



In / (u>) dP — 2 



-(«-!) 



— e 



good q ' 



= / ln/(w)dP(w) -2-("- 1 > -e. 

J cj:\ In /(w)|<^ 

Since this is true for arbitrarily large n and arbitrarily small e, 



D(P\\Q) > / In f(w)dP(u), 



completing the proof of the lemma. □ 



It is worthwhile to point out two examples for the previous lemma. If P and 
M are discrete measures with corresponding pmf’s p and q , than the Radon- 
Nikodym derivative is simply dP/dM(ui) = p(u) /m(u>) and the lemma gives the 
known formula for the discrete case. If P and M are both probability measures 
on Euclidean space lZ n and if both measures are absolutely continuous with 
respect to Lebesgue measure, then there exists a density / called a probability 
density function or pdf such that 
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where dx means dm( x) with m Lebesgue measure. (Lebesgue measure assigns 
each set its volume.) Similarly, there is a pdf g for M. In this case, 

D(P\\M)= [ f(x)ln^dx. (5.5) 

Jn™ 9{x) 

The following immediate corollary to the previous lemma provides a formula 
that is occasionally useful for computing divergences. 

Corollary 5.2.1: Given three probability distributions M » Q » P , 
then 

D(P\\M) = D(P\\Q) + Ep(\n^). 

Proof: From the chain rule for Radon-Nikodym derivatives (e.g., Lemma 
5.7.3 of [50]) 

dP dP dQ 
dM dQ dM 

and taking expectations using the previous lemma yields the corollary. □ 

The next result is a technical result that shows that given a mapping on 
a space, the divergence between the induced distributions can be computed 
from the restrictions of the original measures to the sub-cr-field induced by 
the mapping. As part of the result, the relation between the induced Radon- 
Nikodym derivative and the original derivative is made explicit. 

Recall that if P is a probability measure on a measurable space (Cl,B) and 
if IF is a sub-cr-field of B , then the restriction Pjr of P to T is the probability 
measure on the measurable space (fi,P) defined by Pp{G) = P(G), for all 
G £ T . In other words, we can use either the probability measures on the new 
space or the restrictions of the probability measures on the old space to compute 
the divergence. This motivates considering the properties of divergences of 
restrictions of measures, a useful generality in that it simplifies proofs. The 
following lemma can be viewed as a bookkeeping result relating the divergence 
and the Radon-Nikodym derivatives in the two spaces. 

Lemma 5.2.4: (a) Suppose that M,P are two probability measures on 
a space (f 1,B) and that A is a measurement mapping this space into (A, A). 
Let Px and M x denote the induced distributions (measures on (A, A)) and let 
P a (x) and M ct (x) denote the restrictions of P and M to cr(A), the sub-cr-field 
of B generated by X. Then 



D(P X \\M X ) = £>(P ffW ||M ffW ). 



If the Radon-Nikodym derivative / = dP x /dM x exists (e.g., the above diver- 
gence is finite), then define the function f(X) : O — > [0, oo) by 

/(X)H = /(AH) = ^(A(,)); 
then with probability 1 under both M and P 



/( X) 



dP a (. x) 
dM a ( X ) 
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(b) Suppose that P « M. Then for any sub-cr-field T of B, we have that 



dPjr 

dMjr 



E m ( 



dP 

dM 



\T). 



Thus the Radon-Nikodym derivative for the restrictions is just the conditional 
expectation of the original Radon-Nikodym derivative. 

Proof: The proof is mostly algebra: D{P a ^ X ) \\M a ( X )) is the supremum over 
all finite partitions Q with elements in cr(X) of the relative entropy (<2). 

Each element Q £ Q C cr(X) corresponds to a unique set Q' € A via Q = 
X~ 1 {Q') and hence to each Q C cr(X) there is a corresponding partition Q’ C A. 

The corresponding relative entropies are equal, however, since 



H Px\\M x (Q') 



E W)ln 

Q'eC' 



Px(Q') 

M X (Q') 



= E P(X~ 1 (Q , ))hi 

Q'eC' 



P(X~ 1 (Q')) 

M(x-i(Q0) 



QeQ 



Px(Q) 

M X (Q) 



Taking the supremum over the partitions proves that the divergences are equal. 
If the derivative is / = dPx/dMx, then f(X) is measurable since it is a mea- 
surable function of a measurable function. In addition, it is measurable with 
respect to cr(X) since it depends on u> only through X(u>). For any F £ cr(X) 
there is a G £ A such that F = X~ 1 (G) and 



f f(X)dM a(X )= [ f(X)dM= f fdM x 

J F J F J G 

from the change of variables formula (see, e.g., Lemma 4.4.7 of [50]). Thus 

[ f{X)dM a{x) = P X {G) = P a{X )(X-\G)) = P a{X )(F), 

Jf 



which proves that f(X) is indeed the claimed derivative with probability 1 under 
M and hence also under P. 

The variation quoted in part (b) is proved by direct verification using iterated 
expectation. If G £ F, then using iterated expectation we have that 

J c E ^ mdM r = j E ^o^mdM r 

Since the argument of the integrand is ^"-measurable (see, e.g., Lemma 5.3.1 of 
[50]), invoking iterated expectation (e.g., Corollary 5.9.3 of [50]) yields 

! G E -^ dM ^ S Eu(io ^ mdM 
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= E{l G — ) = P{G) = PA G) , 

proving that the conditional expectation is the claimed derivative. □ 

Part (b) of the Lemma was pointed out to the author by Paul Algoet. 
Having argued above that restrictions of measures are useful when finding 
divergences of random variables, we provide a key trick for treating such restric- 
tions. 

Lemma 5.2.5: Let M » P be two measures on a space (f l,B). Suppose 
that IF is a sub-er-field and that P F and are the restrictions of P and M to 
T Then there is a measure S such that M » S >> P and 

dP _ dP/dM 
dS dPp/dMj r’ 

dS dPjr 
dM ~ dM r ' 

and 

D(P\\S) +D(Pr\\Mr) = D{P\\M). (5.6) 

Proof: If M » P , then clearly Mjr >> P F and hence the appropriate 
Radon-Nikodym derivatives exist. Define the set function S by 

jM iM= S F Eu( ^ mdM ’ 

using part (b) of the previous lemma. Thus M » S and dS/dM = dP^/dM^. 
Observe that for F £ T , iterated expectation implies that 



S(F) = Em (E m (1 f ^\F)) = E m (1f^) 



= P(F) = Pjp(P); F e T 

and hence in particular that S(Q) is 1 so that dP F /dM F is integrable and S is 
indeed a probability measure on (f l,B). (In addition, the restriction of S to T 
is just P F .) Define 

dP/dM 
^ dPj 7 / dM F 

This is well defined since with M probability 1, if the denominator is 0, then 
so is the numerator. Given F € B the Radon-Nikodym theorem (e.g., Theorem 
5.6.1 of [50]) implies that 



I/ dS = 1 dM = 1 1 '^k dP r/ dM r dM = P(F) ' 



dP _ dP/dM 
dS dPjr/dMjr' 



that is, P « S and 
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proving the first part of the lemma. The second part follows by direct verifica- 
tion: 



D(P\\M) = 



In 



dP 

dM 



dP = 




dP F 

dMjr 



dP + 



dP/dM 

dPjr /dM^r 



dP 



= / ln + J In ^ dP = D[P F \\M F ) + D(P\\S). □ 

The two previous lemmas and the divergence inequality immediately yield 
the following result for M » P. If M does not dominate P, then the result is 
trivial. 

Corollary 5.2.2: Given two measures M, P on a space (fi, B) and a sub-cr- 
field T of B, then 

D(P\\M) >D(P F \\M F ). 

If / is a measurement on the given space, then 



D(P\\M) > D(P f \\Mf). 

The result is obvious for finite fields T or finite alphabet measurements / 
from the definition of divergence. The general result for arbitrary measurable 
functions could also have been proved by combining the corresponding finite 
alphabet result of Corollary 2.3.1 and an approximation technique. As above, 
however, we will occasionally get results comparing the divergences of measures 
and their restrictions by combining the trick of Lemma 5.2.5 with a result for a 
single divergence. 

The following corollary follows immediately from Lemma 5.2.2 since the 
union of a sequence of asymptotically generating sub-er-fields is a generating 
field. 

Corollary 5.2.3: Suppose that M,P are probability measures on a mea- 
surable space (f l,B) and that T n is an asymptotically generating sequence of 
sub-er-fields and let P n and M n denote the restrictions of P and M to T n (e.g., 
P n = P Fn ). Then 

D(P n \\M n ) | D(P\\M). 

There are two useful special cases of the above corollary which follow im- 
mediately by specifying a particular sequence of increasing sub-er-fields. The 
following two corollaries give these results. 

Corollary 5.2.4: Let M, P be two probability measures on a measurable 
space (fi ,B). Suppose that / is an A- valued measurement on the space. Assume 
that q n : A — > A n is a sequence of measurable mappings into finite sets A n with 
the property that the sequence of fields T n = P(q n (f )) generated by the sets 
{( 7 “ 1 (a); a € A n } asymptotically generate a(f). (For example, if the original 
space is standard let T n be a basis and let q n map the points in the itli atom 
of T n into i.) Then 



D(P f \\M f ) = Vm o D{P qn(f) \\M qnU) ). 
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The corollary states that the divergence between two distributions of a ran- 
dom variable can be found as a limit of quantized versions of the random vari- 
able. Note that the limit could also be written as 

lim H P u M (q n ). 

n—> oo J J 

In the next corollary we consider increasing sequences of random variables 
instead of increasing sequences of quantizers, that is, more random variables 
(which need not be finite alphabet) instead of ever finer quantizers. The corol- 
lary follows immediately from Corollary 5.2.3 and Lemma 5.2.4. 

Corollary 5 . 2 . 5 : Suppose that M and P are measures on the sequence 
space corresponding to outcomes of a sequence of random variables X®, Xi, ■ ■ ■ 
with alphabet A. Let T n = cr(X 0 , ■ ■ ■ , X n _i), which asymptotically generates 
the er-field <j(Xq, X\, ■ ■ •). Then 

lim D(P X n\\M X n) = D(P\\M). 

n—> oo 

We now develop two fundamental inequalities involving entropy densities 
and divergence. The first inequality is from Pinsker [125]. The second is an 
improvement of an inequality of Pinsker [125] by Csiszar [24] and Kullback [91]. 
The second inequality is more useful when the divergence is small. Coupling 
these inequalities with the trick of Lemma 5.2.5 provides a simple generalization 
of an inequality of [48] and will provide easy proofs of L 1 convergence results 
for entropy and information densities. A key step in the proof involves a notion 
of distance between probability measures and is of interest in its own right. 
Given two probability measures M,P on a common measurable space (£l,B), 
the variational distance between them is defined by 

d(P,M) = sup Y I P(Q) ~ M(Q) |, 

C QeQ 

where the supremum is over all finite measurable partitions. We will proceed by 
stating first the end goal, the two inequalities involving divergence, as a lemma, 
and then state two lemmas giving the basic required properties of the variational 
distance. The lemmas will be proved in a different order. 

Lemma 5 . 2 . 6 : Let P and M be two measures on a common probability 
space (£1,13) with P « M. Let / = dP/dM be the Radon-Nikodym derivative 
and let h = In / be the entropy density. Then 

D(P\\M)< J \h\dP< D(P\\M)+ (5.7) 

J \h\dP < D(P\\M) + \/2D(P\\M). (5.8) 

Lemma 5 . 2 . 7 : Given two probability measures M,P on a common mea- 
surable space (£1,13), the variational distance is given by 

d(P, M) = 2 sup | P(F) - M(F)\. 

F&B 



(5.9) 
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Furthermore, if S is a measure for which P « S and M « S (S = (P+M)/ 2, 
for example), then also 






and the supremum in (5.9) is achieved by the set 

dP dM 

F=( “ : dS M > dS M1 - 



(5.10) 



Lemma 5.2.8 



d{P, M) < s/2D{P\\M). 

Proof of Lemma 5.2. 7: First observe that for any set F we have for the 
partition Q = {F, F c j that 

d(P,M) > J2 I P(Q) ~ M(Q) | = 2|P(F) - M(F)| 

QeC 

and hence 

d(P, M) > 2 sup | P(F) - M(F) |. 

FeB 

Conversely, suppose that Q is a partition which approximately yields the vari- 
ational distance, e.g., 



E \P(Q)-M(Q)\>d(P,M)-e 
QeQ 

for e > 0. Define a set F as the union of all of the Q in Q for which P(Q) > M(Q) 
and we have that 



Y | P(Q) - M{Q) | = P(F) - M(F) + M(F C ) - P(F C ) = 2 (P(F) - M(F)) 
QeQ 



and hence 

d(P, M) - e < sup 2|P(F) - M(P)|. 
FeB 



Since e is arbitrary, this proves the first statement of the lemma. 

Next suppose that a measure S dominating both P and M exists and define 
the set 



dP dM 

F = ( “ [ dS M > 77 Ml 



and observe that 



, dP dM 



,dP dM 



dP dM , 



— - — / (“777 "77T ) ^ 



'dS dS 



dS dS 



dS dS 
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= P(F) - M(F) - ( P(F C ) - M(F C )) = 2 (P(F) - M(F)). 

From the definition of F, however, 

p{F) =!S dS -!M dS ^ M(F) 

so that P(F ) — M(F) = \ P(F) — M{F) j. Thus we have that 

f\%~^\ dS = 2 I P ( F ) - M ( F )I < 2 sup |P(G) - M(G)| = d(P,M). 

To prove the reverse inequality, assume that Q approximately yields the varia- 
tional distance, that is, for e > 0 we have 



E \P(Q)~M(Q)\>d(P,M)-e. 

QeC 



Then 



09 O. 09 0. 




which, since e is arbitrary, proves that 






dM 

~dS 



dS, 



Combining this with the earlier inequality proves (5.10). We have already seen 
that this upper bound is actually achieved with the given choice of F, which 
completes the proof of the lemma. □ 

Proof of Lemma 5.2.8: Assume that M » P since the result is trivial 
otherwise because the right-hand side is infinite. The inequality will follow 
from the first statement of Lemma 5.2.7 and the following inequality: Given 
1 > p, to > 0, 

pin — + (1 — p) In — — — 2 (p — to) 2 > 0. (5.11) 

to 1 — TO 

To see this, suppose the truth of (5.11). Since F can be chosen so that 2(P(F) — 
M(F)) is arbitrarily close to d(P,M), given e > 0 choose a set F such that 
[2 (P(F) — M(F))] 2 > d(P 1 M) 2 — 2e. Since {F,F C } is a partition, 



D(P\\M)- 



d(P,M ) s 



> P(F) In ^ + (1 - P(F)) In * ^ - 2(P(F) - M(F)) 2 - e. 



M(F) 



1 - M(F) 




90 



CHAPTER 5. RELATIVE ENTROPY 



If (5.11) holds, then the right-hand side is bounded below by — e, which proves 
the lemma since e is arbitrarily small. To prove (5.11) observe that the left- 
hand side equals zero for p = m, has a negative derivative with respect to m 
for mn < p, and has a positive derivative with respect to m for in > p. (The 
derivative with respect to m is ( m — p)[l — 4?n(l — m)]/[m(l — m).) Thus the 
left hand side of (5.11) decreases to its minimum value of 0 as m tends to p from 
above or below. □ 

Proof of Lemma 5.2.6: The magnitude entropy density can be written as 

\h(ui)\ = h(uj) + 2h(u)~ (5-12) 

where a~ = — min(a, 0). This inequality immediately gives the trivial left-hand 
inequality of (5.7). The right-hand inequality follows from the fact that 

Jh~dP = J f [lnf]~ dM 

and the elementary inequality a In a > — 1/e. 

The second inequality will follow from (5.12) if we can show that 

2 J h~dP < ^2D{P\\M). 

Let F denote the set {h < 0} and we have from (5.4) that 
2 J h~dP = ~2 J p hdP < —2 P(F) In 
and hence using the inequality In a; < x — 1 and Lemma 5.2.7 

2 J h~dP < 2 P(F) In < 2 (M(F) - P{F )) 

<d(P,M) < y/2D(P\\M), 

completing the proof. □ 

Combining Lemmas 5.2.6 and 5.2.5 yields the following corollary, which gen- 
eralizes Lemma 2 of [54] : 

Corollary 5.2.6: Let P and M be two measures on a space (fi, B). Suppose 
that F is a sub-er-field and that Pjr and M f are the restrictions of P and M 
to T. Assume that M » P. Define the entropy densities h = In dP/dM and 
hf = In dPj. r / dM Then 

J \h~ h'\ dP < D(P\\M) - D(Pjr\\Mjr) + (5.13) 

and 

J \h-ti\dP < D(P\\M)— 

D(P f \\M f ) + y/2D(P\\M) - 2 D(P f \\M f ). (5.14) 

Proof: Choose the measure S as in Lemma 5.2.5 and then apply Lemma 
5.2.6 with S replacing M. □ 
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Variational Description of Divergence 

As in the discrete case, divergence has a variational characterization that is a 
fundamental property for its applications to large deviations theory [143] [31]. 
We again take a detour to state and prove the property without delving into its 
applications. 

Suppose now that P and M are two probability measures on a common 
probability space, say (Q, B), such that M » P and hence the density 



J dM 

is well defined. Suppose that 4> is a real-valued random variable defined on the 
same space, which we explicitly require to be finite-valued (it cannot assume oo 
as a value) and to have finite cumulant generating function: 

-Eiu(e $ ) < oo. 

Then we can define a probability measure M® by 



M*(F) = J 



F 



dM 



(5.15) 



and observe immediately that by construction M » M* and 

dM* _ e * 
dM ~ E M (e*)' 

The measure M* is called a “tilted” distribution. Furthermore, by construction 
dM* /dM y 0 and hence we can write 

[ J dQ = [ J d ^—dM= [ fdM = P(F) 

J F e*/E M {e*) J p / Em{s*) dM J F 

and hence P « M* and 

dP f 

dM* e^/Euie*)' 

We are now ready to state and prove the principal result of this section, a 
variational characterization of divergence. 

Theorem 5.2.1: Suppose that M » P. Then 



D(P\\M) = sup (EpQ - In (E M {e*))) , (5.16) 

$ 



where the supremum is over all random variables 4> for which <f> is finite-valued 
and e* is M-integrable. 

Proof: First consider the random variable $ defined by <F = In f and observe 



that 



Ep® -\n(E M {e*)) 



dP In / — ln( J 



dMf) 
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= D(P\\M) — In J dP = D(P\\M). 

This proves that the supremum over all $ is no smaller than the divergence. To 
prove the other half observe that for any <f>, 

H(P\\M) - (Ep® - hr E M (e*)) = E P (in , 

where is the tilted distribution constructed above. Since M » M® >> P, 
we have from the chain rule for Radon-Nikodym derivatives that 

rlP 

H(P\\M ) - (P P $ - In P M (e $ )) = E P In — ^ = D(P\\M *) > 0 

from the divergence inequality, which completes the proof. Note that equality 
holds and the supremum is achieved if and only if = P. □ 



5.3 Conditional Relative Entropy 



Lemmas 5.2.4 and 5.2.5 combine with basic properties of conditional probability 
in standard spaces to provide an alternative form of Lemma 5.2.5 in terms of 
random variables that gives an interesting connection between the densities for 
combinations of random variables and those for individual random variables. 
The results are collected in Theorem 5.3.1. First, however, several definitions 
are required. Let X and Y be random variables with standard alphabets Ax 
and Ay and cr-fields Ba x an d Fa y , respectively. Let Pxy and Mxy be two 
distributions on (Ax x Ay,Ba x xA y ) an d assume that Mxy » Pxy- Let My 
and P Y denote the induced marginal distributions, e.g., My(F) = M X y(A x x 
F). Define the (nonnegative) densities (Radon-Nikodym derivatives): 



fxY = 



dP- 



XY 



dM X Y 



Jy = 



dPy 

dMy 



so that 



F € B m 



Pxy(F) = f fxydMxY ; 

J F 

P y (F) = [ fydMy ; F e B m 
Jf 



xA\ 



Note that Mxy » Pxy implies that My » Py and hence fy is well defined 
if fxY is. Define also the conditional density 

/,„■(*) = { fpg ft ^ fy(y) > 0 
I 1; otherwise . 



Suppose now that the entropy density 



h Y = In fy 
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exists and define the conditional entropy density or conditional relative entropy 
density by 

hx\Y = ln/x|r- 

Again suppose that these densities exist, we (tentatively) define the conditional 
relative entropy 

H p \\ m {X\Y) = Elnf x \Y = j dP X Y (x,y) In fx\y(x\y) 

= j dM XY {x,y)f XY {x,y)lnfx\ Y (x\y). (5.17) 

if the expectation exists. Note that unlike unconditional relative entropies, 
the above definition of conditional relative entropy requires the existence of 
densities. Although this is sufficient in many of the applications and is con- 
venient for the moment, it is not sufficiently general to handle all the cases 
we will encounter. In particular, there will be situations where we wish to de- 
fine a conditional relative entropy H p \\m(X\Y) even though it is not true that 
M X y » Pxy- Hence at the end of this section we will return to this ques- 
tion and provide a general definition that agrees with the current one when the 
appropriate densities exist and that shares those properties not requiring the 
existence of densities, e.g., the chain rule for relative entropy. An alternative 
approach to a general definition for conditional relative entropy can be found in 
Algoet [6]. 

The previous construction immediately yields the following lemma providing 
chain rules for densities and relative entropies. 

Lemma 5.3.1: 

f.XY = fx\yfy, 

hxY = hx\Y + hy, 

and hence 

D{P X y\\M xy ) = H PllM (X\Y) +D(P y \\M y ), (5.18) 

or, equivalently, 

H p \\ m (X,Y) = H p \\ m (Y) + H p \\m(X\Y), (5.19) 

a chain rule for relative entropy analogous to that for ordinary entropy. Thus if 
H P \\m(Y) < oo so that the indeterminate form oo — oo is avoided, then 

H p \\ m (X\Y) = H p \\ m (X, Y) - H Pm {Y). 

Since the alphabets are standard, there is a regular version of the conditional 
probabilities of X given Y under the distribution M X y', that is, for each y £ 
B there is a probability measure Mx\y{F\v)\ P €E Ba for fixed F £ Ba x 
Mx\y(F\v) is a measurable function of y and such that for all G € Ba y 

M xy (F xG) = E(I g (Y)M x \ y {F\Y)) = f M x]Y {F\y)dM Y {y). 

JG 
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Lemma 5.3.2: Given the previous definitions, define the set B £ Bb to be 
the set of y for which 




Y {x\y)dM x \y{x\y) = 1. 



Define Px\y for y £ B by 



Px\y(F\v) = [ fx | y i x \y)dM x \y(x\y); F£B A 
Jf 

and let P x \y{-\y) be an arbitrary fixed probability measure on (A,Ba) for all 
y qL B. Then My(B) = 1, P x \y is a regular conditional probability for X given 
Y under the distribution Pxy , and 



Px\y « M x \y', M y - a.e., 



that is, M Y ({y : Px\y(-\v) « M x \y{-\v)}) = L Thus if P X y « M X y, we can 
choose regular conditional probabilities under both distributions so that with 
probability one under My the conditional probabilities under P are dominated 
by those under M and 



dP x \Y 

dM x \y 



(x\y) 



dP X \v(-\y) 

dM x \y(-\y) 



(x) = fx\y{x\y)\ x £ A. 



Proof: Define for each y £ B the set function 



G y (F ) = j f x\y(x\y)dMx\y(x\y); F £ B A - 
Jf 



We shall show that G y (F ), y £ B, F £ Ba is a version of a regular conditional 
probability of X given Y under P XY - First observe using iterated expecta- 
tion and the fact that conditional expectations are expectations with respect to 
conditional probability measures ([50], Section 5.9) that for any F £ Bb 



[ [f fx\Y(x\y)dMx\y(x\y)]dMy(y) = E(l F (Y)E[l A (X)fx\ Y \Y]) 

Jf J a 

= E(l F (Y)lA(X)^jA-lf Y>Q ) = J lAxFj^l{f Y >0}fx Y dMxY 

= Lr P M,r>0> dPxY = /, ^ L T dPr ’ 

where the last step follows since since the function being integrated depends only 
on Y and hence is measurable with respect to <r(Y) and therefore its expectation 
can be computed from the restriction of P XY to cr(Y) (see, for example, Lemma 
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5.3.1 of [50]) and since Py(/y > 0) = 1. We can compute this last expectation, 
however, using My as 

f ~7~dP Y = [ f Y dM Y = [ dM Y = M Y (F) 

JF JY JF JY JF 

which yields finally that 

/ [ [ fx\Y(x\y) dMx\y(x\y)] dM Y (y) = M Y (F); all F G B b . 

Jf J A 

If 

[ g(y)dM Y (y) = [ 1 dM Y (y), qIIFgBb, 

J F J F 

however, it must also be true that g = 1 My-a.e. (See, for example, Corollary 
5.3.1 of [50].) Thus we have My-a.e. and hence also Py-a.e. that 

[ fx\Y{x\y)dM x \Y{x\y)]dM Y {y) = 1; 

J A 

that is, My(B) = 1. For y G B , it follows from the basic properties of integration 
that G y is a probability measure on (. A,Ba ) (see Corollary 4.4.3 of [50]). 

By construction, Px\y{'\v) « M x \ y{'\v) f° r all y G B and hence this is 
true with probability 1 under My and Py. Furthermore, by construction 



dPx\y(-\y) 

dM x \ Y (-\y) 



( x ) = fx\r{x\y). 



To complete the proof we need only show that P x \y is indeed a version of the 
conditional probability of X given Y under P X y ■ To do this, fix G G Ba and 
observe for any F G Bb that 

f Px\y(G\v) dP Y (y) = f [f fx\Y{x\y)dM x \ Y (x\y)]dP Y (y) 

J F J F j G 

= [if fx\Y{x\y)dM x \Y(x\y)]f Y (y)dM Y (y) 

Jf Jg 

= E[l F {Y) f Y E[l G {X) f x \ y \Y] = E M [l GxF f XY \, 
again using iterated expectation. This immediately yields 

[ Px\Y(G\y) dPy(y) = [ fxydMxy = [ dP X y = Pxy(G x F), 

Jf Jgxf Jgxf 

which proves that Px\y(G\v) is a version of the conditional probability of X 
given Y under P X y , thereby completing the proof. □ 

Theorem 5.3.1: Given the previous definitions with M X y >> Pxy , define 
the distribution S X y by 



Sxy(F x G) = [ M x \ Y (F\y)dPY(y ), 
Jg 



(5.20) 
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that is, Sxy h as Py as marginal distribution for Y and M x \y as the conditional 
distribution of X given Y . Then the following statements are true: 

1. M X y » Sxy » Pxy- 

2. dSxY / dM X Y = fy and dP X Y /dS X y = fx\Y- 

3- D(P xy \\M xy ) = D(Py\\M y ) + D(P X y\\Sxy )> and hence D(P X y\\M xy ) 
exceeds D(P y \\My) by an amount D(P xy \\S X y) = H P \\m(X\Y). 

Proof: To apply Lemma 5.2.5 define P = P X y , M = M XY , T — cr(F), 
P' = P a (Y ), and M' = AI a yy Define S by 



S(F xG)= / ^P^dMxr, 

JFxG dM a (Y) 



for F € Ba and G € Bb ■ We begin by showing that S = S X y- All of the 
properties will then follow from Lemma 5.2.5. 

For F € Ba x and G € Ba y 



where the expectation is with respect to M X y ■ Using Lemma 5.2.4 and iterated 
conditional expectation (c.f. Corollary 5.9.3 of [50]) yields 



E 



^IfxG 



dPa(Y) \ 
dM G (y) ) 



K wx)lo(y) Su y >) 



= E [\ g {Y)^ y {Y)E[I f {X)\Y]^ = E (\ G {Y)^-{Y)M x \y{F\Y) S j 

J M x \ Y (F\y)^^{y)dMy(y) = J M x]Y (F\y) dP Y (y), 

proving that S = S X y. Thus Lemma 5.5.2 implies that M XY » S XY » 
Pxy j proving the first property. 

From Lemma 5.2.4, dP'/dM' = dP^y) / dM^y) = dPy/dMy = fy, proving 
the first equality of property 2. This fact and the first property imply the second 
equality of property 2 from the chain rule of Radon-Nikodym derivatives. (See, 
e.g., Lemma 5.7.3 of [50].) Alternatively, the second equality of the second 
property follows from Lemma 5.2.5 since 

dP X Y _ dP X y/dM X y _ f XY 
dS X y dM X y /dSxy fy 

Corollary 5.2.1 therefore implies that D(P XY \\M X y) = D(P XY \\S XY ) + 
D(S X y\\M xy ), which with Property 2, Lemma 5.2.3, and the definition of 
relative entropy rate imply Property 3. □ 




5.3. CONDITIONAL RELATIVE ENTROPY 



97 



It should be observed that it is not necessarily true that D(P X y\\S X y ) > 
D(Px\\Mx) and hence that D{P X y\\M X y) > D(P X \\M X ) + D(Py||My) as 
one might expect since in general S x ^ M x ■ These formulas will, however, be 
true in the special case where M X y = Mx x My. 

We next turn to an extension and elaboration of the theorem when there 
are three random variables instead of two. This will be a crucial generalization 
for our later considerations of processes, when the three random variables will 
be replaced by the current output, a finite number of previous outputs, and the 
infinite past. 

Suppose that M X yz » Pxyz are two distributions for three standard 
alphabet random variables X, Y, and Z taking values in measurable spaces 
(A x ,Ba x ), ( Ay,Ba y )> ( Az,Ba z ), respectively. Observe that the absolute con- 
tinuity implies absolute continuity for the restrictions, e.g., M X y >> Pxy and 
My >> Py. Define the Radon-Nikodym derivatives fxYZ, /yz, /y, etc. in 
the obvious way; for example, 



fxYZ 



dPxYZ 

dMxYZ 



Let h X YZ, hyz, hy, etc., denote the corresponding relative entropy densities, 



e-g-, 



hxYZ = In fxY z • 



Define as previously the conditional densities 



, _ fxYZ 

Ix ' YZ ~ 177’ 

the conditional entropy densities 

h x \ yz = 1 n fx\Yz\ h x \ Y = hi f x \ y i 
and the conditional relative entropies 

Hp\\m(X\Y) = P(ln f x \y) 



fx\Y = 



fxY 

fr ’ 



and 

Hp\\m{X\Y,Z) = E(ln f x \Yz)- 

By construction (or by double use of Lemma 5.3.1) we have the following chain 
rules for conditional relative entropy and its densities. 

Lemma 5.3.3: 

fxYZ = fx\Yzfy\zfz, 
hxYZ = h X \ YZ + hy\Z + hz, 

and hence 



H P Y, Z) = H PllM (X\YZ) + H Pm {Y\Z) + H Pm {Z). 
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Corollary 5.3.1: Given a distribution P X y, suppose that there is a product 
distribution VI xy = VI x x My >> Pxy ■ Then 

VI xy » Px x P Y » Pxy , 

dPxY _ fxY _ fx\Y 
d(Px x P Y ) ~ fxfr ~ fx ’ 
d{Px X Py) 

dVIxY fxh ’ 

D{P X y\\Px x P Y ) + H P m {X) = H P]lM (X\Y), 

and 

D(P X x Py||M A y) = H p \\ m (X) + H P \\ M (Y). 

Proof: First apply Theorem 5.3.1 with Mxy = M x x My. Since VI X y is 
a product measure, M x \y = Mx and VI X y » Sxy = M x x P Y » Pxy 
from the theorem. Next we again apply Theorem 5.3.1, but this time the roles 
of X and Y in the theorem are reversed and we replace Mxy in the theorem 
statement by the current Sxy = M x x Py and we replace Sxy in the theorem 
statement by 

S' XY (P X G) = f Sy lx (G\x) dPx{x) = P A -(P)Py(G); 

J F 

that is, S' xy = P.Y x Py. We then conclude from the theorem that S' XY = 
P x x Py >> P X y, proving the first statement. We now have that 

VIxy = M X x VI Y » Px x Py » P X Y 

and hence the chain rule for Radon-Nikodym derivatives (e.g., Lemma 5.7.3 of 
[50]) implies that 



dPxY 

dM X y 



dPxY d(P X x Py) 
d( P x x Py) d( VI x x VI Y ) 



It is straightforward to verify directly that 



d{P X x Py) 
d ( Vf x X My) 



dPx dPy 

dM x dVIy 



fx /y 



and hence 



fxv 



dPxY 

d(P X x Py) 



)fxfy, 



as claimed. Taking expectations using Lemma 5.2.3 then completes the proof 
(as in the proof of Corollary 5.2.1.) □ 

The lemma provides an interpretation of the product measure P x x Py . This 
measure yields independent random variables with the same marginal distribu- 
tions as Pxy, which motivates calling P x x Py the independent approximation 
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or memoryless approximation to P X y- The next corollary further enhances this 
name by showing that Px x Py is the best such approximation in the sense of 
yielding the minimum divergence with respect to the original distribution. 

Corollary 5.3.2: Given a distribution Pxy let A4 denote the class of all 
product distributions for XY; that is, if Mxy G A4, then Mxy — M X x My. 
Then 

inf D(P X y\\M X y) = D(P X y\\Px x Py). 

Mxy&M 

Proof: We need only consider those M yielding finite divergence (since if 
there are none, both sides of the formula are infinite and the corollary is trivially 
true). Then 

D(P xy \\Mxy) = D(P X y\\Px x Py) + D(P X X Py\\M X y) 

> D(P X y\\Px x Py) 

with equality if and only if D(P X x Py\ \M X y) = 0, which it will be if M X y = 

P X X Py. a 

Recall that given random variables (X,Y,Z) with distribution M X yz, then 
X — > Y — > Z is a Markov chain (with respect to M X yz ) if for any event 
F G Ba z with probability one 

M z \yx(F\y,x) = M z \ Y (F\y). 

If this holds, we also say that X and Z are conditionally independent given Y . 
Equivalently, if we define the distribution M Xx z\y by 

M Xx z\y{Px x F z x Fy) = f M x \Y{Fx\y)M z \Y(Fz\y)dMy(y)\ 

JFy 

F x G Ba x \ F z G Ba z ; Fy G Ba y \ 

then Z — > Y — * X is a Markov chain if M Xx z\y = M X yz ■ (See Section 5.10 of 
[50].) This construction shows that a Markov chain is symmetric in the sense 
that X — > Y — > Z if and only if Z — > Y — > X. 

Note that for any measure M X yz , X — > Y — > Z is a Markov chain under 
M Xx z\y by construction. 

The following corollary highlights special properties of the various densities 
and relative entropies when the dominating measure is a Markov chain. It will 
lead to the idea of a Markov approximation to an arbitrary distribution on 
triples extending the independent approximation of the previous corollary. 

Corollary 5.3.3: Given a probability space, suppose that M X yz » Pxyz 
are two distributions for a random vector (X, Y, Z) with the property that Z — > 
Y — > X forms a Markov chain under M. Then 

M X yz » P Xx z\y » Pxyz 

and 

dPxYZ _ fx\YZ 



dP Xx Z\Y fx\Y 



(5.21) 
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dPxxZ\Y 

(IMxyz 



— Iy z f X\Y • 



Thus 



, dPxYZ , , 

in ~rs 1- n > x \ y — h x \YZ 

dP X xZ\Y 
, dP X xZ\Y , 
dMxYZ 



■YZ + h X \Y 



and taking expectations yields 

D(Pxyz\\Pxxz\y) + H Pm {X\Y) = H Pm {X\YZ) 



(5.22) 



(5.23) 



D(Pxxz\y\\Mxyz) — D(P yz \\Myz) + Hp\\m{X\Y). (5.24) 

Furthermore, 

Pxxz\y = Px\yPyz, (5.25) 

that is, 

Pxxz\y{Fx x Fz x Fy) = [ Px\Y(Px\y)dPzY{z,y). (5.26) 

J Fy x Fz 

Lastly, if Z — ■> Y — > X is a Markov chain under M, then it is also a Markov 
chain under P if and only if 



hx\Y = h x \YZ (5-27) 

in which case 

H p \\ m (X\Y) = H Pm {X\YZ). (5.28) 

Proof: Define 

, x _ fx\Yz{x\y,z) _ fxYzjx, y, z) f Y (y) 
fx\Y(x\y) fYz{y,z) /xy(x, y) 

and simplify notation by defining the measure Q = Pxxz\y- Note that Z — > 
Y — » X is a Markov chain with respect to Q. To prove the first statement of 
the Corollary requires proving the following relation: 

Pxyz{Fx x Fy x F z ) = f gdQ; 

J Fx x Fy x Fz 

all F x € Ba x ,F z G Ba z ,F y € Ba y ■ 

From iterated expectation with respect to Q (e.g., Section 5.9 of [50]) 

E{gl Fx {X)l Fz {Z)l FY {Y)) = E(1 Fy (Y)1 Fz (Z)E( 9 1 Fx (X)\YZ)) 

= / l F Y {y)lF z {z){j g(x,y,z)dQx\Yz(x\y,z))dQ Y z(y,z). 

J J F x 
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Since Qyz = Pyz and Qx\yz — P.x\y Q-a.e. by construction, the previous 
formula implies that 



gdQ = 



' F x x Fy x Fz 



' Fy x Fz 



dPyz 




This proves (5.25. Since Mxyz » Pxyz , we also have that VI xy » Pxy 
and hence application of Theorem 5.3.1 yields 



• Fx x Fy x Fz 



gdQ = 



' Fy x Fz 




gfx\vdM x \Y 




fx\YzdM x \Y- 



By assumption, however, M x \y — VI\\yz a.e. and therefore 



gdQ = 



/ Fx x F y x Fz 



' Fy x Fz 



dPyz / fx\YZ dM x \YZ 



f Fx 



> Fy x Fz 



dPyz / dP x \YZ = Pxyz(Fx x F y x F z ), 



' Fx 



where the final step follows from iterated expectation. This proves (5.21 and 
that Q >> Pxyz- 

To prove (5.22) we proceed in a similar manner and replace g by fxyyfzY 
and replace Q by M X yz = M Xx y\z- Also abbreviate PxxY\z to P. As in the 
proof of (5.21) we have since Z — > Y — > X is a Markov chain under M that 



gdQ = 



' Fx x F y x Fz 



' Fy x Fz 



dVIyz [ gdVI x \Y 
J Fx 

[ fzY dVI Y z ( f fx\Y dVI x P = [ dP Y z ( [ fx\Y dM x \Y^\ ■ 

J Fy x Fz \d Fx J d Fy x Fz \d Fx / 



From Theorem 5.3.1 this is 



' Fy x Fz 



Px\y{Fx\v) dP Y z- 



But Pyz = Pyz and 

Px\y(Px\u) = Px\y{Fx\u) = Px\Yz(Fx\yz) 

since P yields a Markov chain. Thus the previous formula is P(Fx x Fy x Fz), 
proving (5.22) and the corresponding absolute continuity. 

If Z — > Y — * X is a Markov chain under both M and P, then Pxxz\y = 
Pxyz and hence 

dPxYZ _ i _ fx\YZ 
dPxxz\Y fx\Y 
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which implies (5.27). Conversely, if (5.27) holds, then fx\YZ = fx\r which with 
(5.21) implies that Pxyz = Pxxz\y > proving that Z — > Y — > X is a Markov 
chain under P. □ 

The previous corollary and one of the constructions used will prove important 
later and hence it is emphasized now with a definition and another corollary 
giving an interesting interpretation. 

Given a distribution Pxyz , define the distribution Pxxz\y as the Markov 
approximation to Pxyz- Abbreviate PxxZlY to P. The definition has two 
motivations. First, the distribution P makes Z — > Y — > X a Markov chain 
which has the same initial distribution Pzy = Pzy and the same conditional 
distribution Px\y = Px\Yi the only difference is that P yields a Markov chain, 
that is, Px\zy = Px\y- The second motivation is the following corollary which 
shows that of all Markov distributions, P is the closest to P in the sense of 
minimizing the divergence. 

Corollary 5.3.4: Given a distribution P = Pxyz > let A4 denote the class 
of all distributions for XY Z for which Z — * Y — > X is a Markov chain under 
M X yz ( M X yz = M X xz\y )■ Then 

inf D(P X yz\\Mxyz) = D{P X yz\\Pxxz\y)\ 

Mxyz&M 

that is, the infimum is a minimum and it is achieved by the Markov approxi- 
mation. 

Proof: If no M X yz in the constraint set satisfies M X yz » Pxyz , then 
both sides of the above equation are infinite. Hence confine interest to the case 
Mxyz » Pxyz- Similarly, if all such Mxyz yield an infinite divergence, we 
are done. Hence we also consider only Mxyz yielding finite divergence. Then 
the previous corollary implies that Mxyz » Pxxz\y » Pxyz and hence 

D(Pxyz\\M X yz ) = D(P X yz\\Pxxz\y ) + D(Pxxz\y\\Mxyz) 

> D{Pxyz\\Pxxz\y) 

with equality if and only if 

D{Pxxz\y\\Mxy z) = D(Pyz\\M Y z) + H p \\ m (X\Y) = 0. 

But this will be zero if M is the Markov approximation to P since then Myz = 
Pyz and M x \y — Px\y by construction. □ 

Generalized Conditional Relative Entropy 

We now return to the issue of providing a general definition of conditional 
relative entropy, that is, one which does not require the existence of the densities 
or, equivalently, the absolute continuity of the underlying measures. We require, 
however, that the general definition reduce to that considered thus far when the 
densities exist so that all of the results of this section will remain valid when 
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applicable. The general definition takes advantage of the basic construction of 
the early part of this section. Once again let Mxy and Pxy be two measures, 
where we no longer assume that Mxy >> Pxy- Define as in Theorem 5.3.1 
the modified measure Sxy by 

S xy (FxG)= f M X \ Y (F\y)dP Y (y ); (5.29) 

JG 

that is, Sxy has the same Y marginal as Pxy and the same conditional distri- 
bution of X given Y as Mxy- We now replace the previous definition by the 
following: The conditional relative entropy is defined by 

Hp llM (X\Y) = D(P X y\\Sxy )■ (5.30) 

If Mxy » Pxy as before, then from Theorem 5.3.1 this is the same quantity 
as the original definition and there is no change. The divergence of (5.30), 
however, is well-defined even if it is not true that Mxy >> Pxy and hence 
the densities used in the original definition do not work. The key question is 
whether or not the chain rule 

H p \\ m (Y) + H nM {X\Y) = H Pm {XY) (5.31) 

remains valid in the more general setting. It has already been proven in the case 
that Mxy » Pxy, hence suppose this does not hold. In this case, if it is also 
true that My » Py does not hold, then both the marginal and joint relative 
entropies will be infinite and (5.31) again must hold since the conditional relative 
entropy is nonnegative. Thus we need only show that the formula holds for the 
case where M Y » Py but it is not true that M X y » Pxy- By assumption 
there must be an event F for which 

Mxy(F) = j Mx\ y (F v ) dM Y (y) = 0 

but 

Pxy(F) = J P x | y ( F y ) dP Y (y) yf 0, 

where F y = {(x, y) : (x, y) € F} is the section of F at F y . Thus M x \v{Py) - 0 
My- a.e. and hence also Py-a.e. since Aly » Py- Thus 

Sxy(F) = J M x \y{F v ) dP Y (y) = 0 

and hence it is not true that Sxy » Pxy and therefore 

D(Pxy\\Sxy) =oo, 

which proves that the chain rule holds in the general case. 

It can happen that Pxy is not absolutely continuous with respect to Mxy, 
and yet D(P X y\\Sxy) < °o and hence Pxy « Sxy and hence 

H p \\m(X\Y) = J dPxY ln^^, 
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in which case it makes sense to define the conditional density 

_ dP X Y 
fx ' Y = dS: 

so that exactly as in the original tentative definition in terms of densities (5.17) 
we have that 

H p \\ m {X\Y) = J dP X Y In fx\Y- 

Note that this allows us to define a meaningful conditional density even though 
the joint density fxY does not exist! If the joint density does exist, then the 
conditional density reduces to the previous definition from Theorem 5.3.1. 

We summarize the generalization in the following theorem. 

Theorem 5.3.2 The conditional relative entropy defined by (5.30) and 
(5.29) agrees with the definition (5.17) in terms of densities and satisfies the 
chain rule (5.31). If the conditional relative entropy is finite, then 

H Pm {X\Y) = j dP XY In fx\ Y , 

where the conditional density is defined by 

_ dP X Y 
,xlr = dS 

If M X y » Pxy , then this reduces to the usual definition 

( _ fxY 

fx ' r ~17 ■ 

The generalizations can be extended to three or more random variables in the 
obvious manner. 



5.4 Limiting Entropy Densities 

We now combine several of the results of the previous section to obtain results 
characterizing the limits of certain relative entropy densities. 

Lemma 5.4.1: Given a probability space (f l,B) and an asymptotically 
generating sequence of sub-cr-fields T n and two measures M » P, let P n = 
Pj - n , M n = My - n and let h n = In dP n /dM n and h = In dP/dM denote the 
entropy densities. If D(P\\M) < oo, then 

| h n — h\ dP = 0, 

that is, h n — > h in L 1 . Thus the entropy densities h n are uniformly integrable. 
Proof: Follows from the Corollaries 5.2.3 and 5.2.6. □ 

The following lemma is Lemma 1 of Algoet and Cover [7]. 
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Lemma 5.4.2: Given a sequence of nonnegative random variables {/„} 
defined on a probability space ( ), suppose that 

E(fn) < 1; all n. 



Then 



limsup — In /„ < 0. 

n—> oo ^ 



Proof: Given any e > 0 the Markov inequality and the given assumption imply 
that 

P(fn > e ne ) < < e-" £ . 



We therefore have that 



and therefore 



P{- In /„ > e) < e" 
n 



E p (-i n ^ ^ e ) ^ 




< 00 , 



Thus from the Borel-Cantelli lemma (Lemma 4.6.3 of [50]), P(n 1 h n > e i.o.) 
= 0. Since e is arbitrary, the lemma is proved. □ 

The lemma easily gives the first half of the following result, which is also 
due to Algoet and Cover [7], but the proof is different here and does not use 
martingale theory. The result is the generalization of Lemma 2.7.1. 

Theorem 5.4.1: Given a probability space (f 2,13) and an asymptotically 
generating sequence of sub-cr-fields T n , let M and P be two probability measures 
with their restrictions M n = My^ n and P n = P^ n . Suppose that M n » P n for 
all n and define f n = dP n /dM n and h n = ln/„. Then 

limsup — h n < 0, M — a.e. 

n—> oo Tl 



and 



liminf —h n > 0, P — a.e.. 

n—> oo 77, 



If it is also true that M » P (e.g., D(P\\M) < oo), then 



lim — h n = 0, P — a.e.. 

n— ► oo 77, 



Proof: Since 

Em fn = E Mn f n = 1 , 

the first statement follows from the previous lemma. To prove the second state- 
ment consider the probability 

P(~- In d ^>e)=P n (-~ In f n >e) = P n (f n < e~™) 
n M n n 
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= [ dP n = f f n dM n 

J f-n<e~ n ‘ j fn<e~ ne 

< e~ ne f dM n = e~ ne M n (f n < e~ ne ) < e~ ne . 

■' fn<e~ n€ 

Thus it has been shown that 

P(-h n < -e) < e~ ne 
n 

and hence again applying the Borel-Cantelli lemma we have that 

P(n _1 /i„ < — e i.o.) = 0 

which proves the second claim of the theorem. 

If M » P, then the first result also holds P-a.e., which with the second 
result proves the final claim. □ 

Barron [9] provides an additional property of the sequence h n /n. If M » P, 
then the sequence h n /n is dominated by an integrable function. 



5.5 Information for General Alphabets 

We can now use the divergence results of the previous sections to generalize the 
definitions of information and to develop their basic properties. We assume now 
that all random variables and processes are defined on a common underlying 
probability space (f 1,B,P). As we have seen how all of the various information 
quantities-entropy, mutual information, conditional mutual information-can be 
expressed in terms of divergence in the finite case, we immediately have defi- 
nitions for the general case. Given two random variables X and Y, define the 
average mutual information between them by 

I{X; Y) = D(P xy \\Px x P Y ), (5.32) 

where Pxy is the joint distribution of the random variables X and Y and 
Px x Py is the product distribution. 

Define the entropy of a single random variable X by 

H{X) = I{X-X). (5.33) 

From the definition of divergence this implies that 

/(X; Y) = supH PxYllPxX p Y (Q). 

Q 

From Dobrushin’s theorem (Lemma 5.2.2), the supremum can be taken over 
partitions whose elements are contained in generating field. Letting the gen- 
erating field be the field of all rectangles of the form F x G, F € Ba x and 
G G Ba y : we have the following lemma which is often used as a definition for 
mutual information. 
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Lemma 5.5.1: 

I{X-Y) = sup %(X); r(Y)), 

q,r 

where the supremum is over all quantizers q and r of Ax and Ay. Hence there 
exist sequences of increasingly fine quantizers q n : Ax — ■» A n and r n : Ay — > B n 
such that 

I(X-Y) = lim I(q n (X)-, r n (Y)). 

n — >oo 

Applying this result to entropy we have that 

H(X)= sap H(q(X)), 

q 

where the supremum is over all quantizers. 

By “increasingly fine” quantizers is meant that the corresponding partitions 
Qn = {q^ 1 ( a ); a G A„} are successive refinements, e.g., atoms in Q n are unions 
of atoms in Q n + 1 - (If this were not so, a new quantizer could be defined for 
which it was true.) There is an important drawback to the lemma (which will 
shortly be removed in Lemma 5.5.5 for the special case where the alphabets 
are standard): the quantizers which approach the suprema may depend on the 
underlying measure Pxy- In particular, a sequence of quantizers which work 
for one measure need not work for another. 

Given a third random variable Z , let Ax, Ay, and Az denote the alphabets 
of X, Y , and Z and define the conditional average mutual information 

I{X-Y\Z) = D(P X yz\\Pxxy\z). (5.34) 

This is the extension of the discrete alphabet definition of (2.25) and it makes 
sense only if the distribution PxxY\z exists, which is the case if the alphabets 
are standard but may not be the case otherwise. We shall later provide an 
alternative definition due to Wyner [152] that is valid more generally and equal 
to the above when the spaces are standard. 

Note that I(X\Y\Z) can be interpreted using Corollary 5.3.4 as the diver- 
gence between Pxyz and its Markov approximation. 

Combining these definitions with Lemma 5.2.1 yields the following general- 
izations of the discrete alphabet results. 

Lemma 5.5.2: Given two random variables X and Y, then 

I(X; Y) > 0 

with equality if and only if X and Y are independent. Given three random 
variables X, Y , and Z, then 



I{X-Y\Z) > 0 

with equality if and only if Y — > Z — > X form a Markov chain. 

Proof: The first statement follow from Lemma 5.2.1 since X and Y are 
independent if and only if Pxy = Px x Py. The second statement follows 
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from (5.5.3) and the fact that Y — > Z — > X is a Markov chain if and only if 
Pxyz = PxxY\z (see, e.g., Corollary 5.10.1 of [50]). □ 

The properties of divergence provide means of computing and approximating 
these information measures. From Lemma 5.2.3, if I(X\ Y ) is finite, then 

« x ^ = J b T { 4rSvj"’"' < 5 - 35 > 

and if I(X; Y\Z) is finite, then 

I{X ; Y\Z) = [ In f£ XYZ dP XY z ■ (5.36) 

J ( i- t XxY\Z 

For example, if X , Y are two random variables whose distribution is abso- 
lutely continuous with respect to Lebesgue measure dxdy and hence which have 
a pdf fx Y (x, y) = dP X y{xy)/dxdy, then 

I{X-Y)= f dxdyf XY (xy) In 

J fx(x)f Y {y) 

where f x and fy are the marginal pdf’s, e.g., 

fx(x) = J fxv{x, y) dy = dP *^ ■ 



In the cases where these densities exist, we define the information densities 



ix-Y 



= In 



dPxY 

d(P X X Py) 



(5.37) 



ix-,Y\z 



= In 



dPxYZ 

dPxxY\Z 



The results of Section 5.3 can be used to provide conditions under which the 
various information densities exist and to relate them to each other. Corollaries 
5.3.1 and 5.3.2 combined with the definition of mutual information immediately 
yield the following two results. 

Lemma 5.5.3: Let X and Y be standard alphabet random variables with 
distribution P XY . Suppose that there exists a product distribution M X y = 
M x x My such that M X y » Pxy ■ Then 



M X y » Px x P Y » Pxy, 

ix-,Y = In {fxy/ fxfy) = M/x|y//x) 



and 



I{X ; Y) + H Pm {X) = H Pm {X\Y). 



(5.38) 
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Comment: This generalizes the fact that I(X\ Y) = H(X) — H(X\Y) for the 
finite alphabet case. The sign reversal results from the difference in definitions 
of relative entropy and entropy. Note that this implies that unlike ordinary 
entropy, relative entropy is increased by conditioning, at least when the reference 
measure is a product measure. 

The previous lemma provides an apparently more general test for the exis- 
tence of a mutual information density than the requirement that Px x Py » 
Pxy, it states that if Pxy is dominated by any product measure, then it is also 
dominated by the product of its own marginals and hence the densities exist. 
The generality is only apparent, however, as the given condition implies from 
Corollary 5.3.1 that the distribution is dominated by its independent approx- 
imation. Restating Corollary 5.3.1 in terms of mutual information yields the 
following. 

Corollary 5.5.1: Given a distribution Pxy let A4 denote the collection of 
all product distributions Mxy = M x x My. Then 

I(X-Y) = inf H PllM (X\Y) = inf D(P XY \\M XY ). 

M X y£M M X y EM 

The next result is an extension of Lemma 5.5.3 to conditional information 
densities and relative entropy densities when three random variables are con- 
sidered. It follows immediately from Corollary 5.3.3 and the definition of con- 
ditional information density. 

Lemma 5.5.4: (The chain rule for relative entropy densities) Suppose that 
M X yz » P X yz are two distributions for three standard alphabet random 
variables and that Z — > Y — > X is a Markov chain under M X yz • Let fx\YZ, 
fx\Y , h x \YZi and h x \y be as in Section 5.3. Then P X xz\y » Pxyz, 

h X \YZ = i X ;Z\Y + hx\ y (5.39) 

and 

H P[lM (X\Y, Z ) = I(X ; Z\Y) + H P]]M (X\Y). (5.40) 

Thus, for example, 



H PllM (X\Y,Z) > H P \ ]M (X\Y). 

As with Corollary 5.5.1, the lemma implies a variational description of con- 
ditional mutual information. The result is just a restatement of Corollary 5.3.4. 

Corollary 5.5.2: Given a distribution Pxyz let A4 denote the class of all 
distributions for XY Z under which Z — + Y — + X is a Markov chain, then 

I(X;Z\Y)= inf H P n M (X\Y, Z) = inf D(P X yz\ \M X yz), 

Mxy Mxy zGjM 

and the minimum is achieved by M X yz = P X xz\y- 

The following Corollary relates the information densities of the various in- 
formation measures and extends Kolmogorov’s equality to standard alphabets. 
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Corollary 5.5.3: (The chain rule for information densities and Kolmogorov’s 
formula.) Suppose that X,Y, and Z are random variables with standard al- 
phabets and distribution Pxyz- Suppose also that there exists a distribution 
Mxyz = M x x Myz such that Mxyz » Pxyz ■ (This is true, for example, 
if Px x Pyz » Pxyz-) Then the information densities ix-,z\Y, ix-,Y, and 
ix-,(YZ) exist and are related by 

ix-,Z\Y + LX ;Y = ix-,(Y,Z) (5-41) 

and 

Z\Y) + I(X- Y) = /(X; (y, Z)). (5.42) 

Proof: If Mxyz = M x x Myz , then Z — + Y — > X is trivially a Markov chain 
since M x \yz — M x \y = Mx- Thus the previous lemma can be applied to this 
Mxyz to conclude that Pxxz\y » Pxyz and that (5.39) holds. We also have 
that M X y — M x x M y » Pxy- Thus all of the densities exist. Applying 
Lemma 5.5.3 to the product measures Mxy = M\ x My and M X (yz) 

\ ! \ x Myz in (5.39) yields 



ix-,z\Y — h x \ yz ~ h x \ y — ln/x|rz — ^fx\Y 



. fx\YZ . fx\Y 

= hr — In — — = Ix-yz ~ ix-,Y- 

Jx J X 

Taking expectations completes the proof. □ 

The previous Corollary implies that if Px x Pyz » Pxyz, then also 
Pxxz\y » Pxyz and Px x Py >> Pxy and hence that the existence of 
ix-AY,z) implies that of ix-,z\Y and ix-Y- The following result provides a con- 
verse to this fact: the existence of the latter two densities implies that of the 
first. The result is due to Dobrushin [32]. (See also Theorem 3.6.1 of Pinsker 
[125] and the translator’s comments.) 

Corollary 5.5.4: If PxxZ\Y » Pxyz and P\ x Py >> Pxy, then also 
Px x Pyz » Pxyz and 



dPxYZ _ dPxY 
d(Px x Pyz) d(Px x Py) 



Thus the conclusions of Corollary 5.5.3 hold. 

Proof: The key to the proof is the demonstration that 

dPxY _ dP Xx z\Y 
d(Px x P Y ) d(Px x Pyz) ’ 



(5.43) 



which implies that P\ x Pyz » Pxxz\y- Since it is assumed that Pxx.z\y » 
Pxyz, the result then follows from the chain rule for Radon-Nikodym deriva- 
tives. 
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Eq. (5.43) will be proved if it is shown that for all F x £ Ba X i Fy £ Ba y , 
and Fz £ Ba z , 



Pxxz\y{Fx x F z x Fy) = J 



dP X Y 



Fx x Fz x Fy d(Px X Py) 



d{P x x P YZ ). (5.44) 



The thrust of the proof is the demonstration that for any measurable nonnega- 
tive function f(x, z ) 

[ f{x,y)d{P x x P Y z)(x,y,z) 

J zeF z 

= J f{x 1 y)P Z \y{F z \y)d(P x x P Y )(x, y). (5.45) 

The lemma will then follow by substituting 

f(x,y) = d (p^ X Jp Y \ ( x > y) X Fx {x)If y {y) 
into (5.45) to obtain (5.44). 

To prove (5.45) hrst consider indicator functions of rectangles: f(x,y) = 
1 F x ( x )^F Y (y)- Then both sides of (5.45) equal P x {F x )Py Z (Fy x Fy) from the 
definitions of conditional probability and product measures. In particular, from 
Lemma 5.10.1 of [50] the left-hand side is 

/ 1 F X {x)l F Y {y) d(P X x P YZ )(x, y, z) 

J zGFz 



= ( J lF x dPx)( J 1 F y xF z dPyz) = Px(F)Py Z (Fy X F Z ) 
and the right-hand side is 

J 1 F x {x)l FY (y)P z \Y(F z \y ) d(P x x P Y )(x,y ) 

= (J 1 f x (x) dP x {x))( J l FY {y)Pz\ Y {F z \y) dP Y {y)) 

= Px(F)Py Z (Fy X F z ), 

as claimed. This implies (5.45) holds also for simple functions and hence also 
for positive functions by the usual approximation arguments. □ 

Note that Kolmogorov’s formula (5.40) gives a formula for computing con- 
ditional mutual information as 

I{X-Z\Y) = I(X-{Y,Z))-I(X-Y). 

The formula is only useful if it is not indeterminate, that is, not of the form oo — 
oo. This will be the case if I(Y\ Z) (the smaller of the two mutual informations) 
is finite. 
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Corollary 5.2.5 provides a means of approximating mutual information by 
that of finite alphabet random variables. Assume now that the random variables 
X,Y have standard alphabets. For, say, random variable X with alphabet Ax 
there must then be an asymptotically generating sequence of finite fields Tx (n) 
with atoms A. x(n), that is, all of the members of Tx(n) can be written as unions 
of disjoint sets in Ax(n) and Tx{n) T &a x i that is, Ba x = a ({J n ^x{n))- The 
atoms Ax(n) form a partition of the alphabet of X. 

Consider the divergence result of Corollary 5.2.5. with P = P\y , M = 
PxxPy and quantizer q( n )(x, y) = (q^ (x), qy (y))- Consider the limit n — > oo. 
Since Tx ( n ) asymptotically generates Ba x and Ty (n) asymptotically generates 
Ba y and since the pair u-field Ba x xA y is generated by rectangles, the field 
generated by all sets of the form Fx x Fy with Fx € Tx(n ), some n, and 
Fy € Ty(m), some m, generates Ba x xA y - Hence Corollary 5.2.5 yields the 
first result of the following lemma. The second is a special case of the first. The 
result shows that the quantizers of Lemma 5.5.1 can be chosen in a manner not 
depending on the underlying measure if the alphabets are standard. 

Lemma 5.5.5: Suppose that X and Y are random variables with standard 
alphabets defined on a common probability space. Suppose that q n = 
1, 2, • • • is a sequence of quantizers for Ax such that the corresponding partitions 
asymptotically generate Ba x ■ Define quantizers for Y similarly. Then for any 
distribution Pxy 

I(X', Y) = lim I(q^\x) ;q ^\Y)) 

n—> oo 

and 

H(X) = lim H(q^\X))-, 

n — >-oo 

that is, the same quantizer sequence works for all distributions. 

An immediate application of the lemma is the extension of the convexity 
properties of Lemma 2.5.4 to standard alphabets. 

Corollary 5.5.5: Let y denote a distribution on a space (A x ,Ba x ), and 
let is be a regular conditional distribution is(F \x) = Pr(F € F\X = x), x £ Ax, 
F £ Ba y ■ Let yis denote the resulting joint distribution. Let I^ v = I llv {X\Y) 
be the average mutual information. Then is a convex (J function of is and 
a convex fj function of y. 

Proof: Follows immediately from Lemma 5.5.5 and the finite alphabet result 
Lemma 2.5.4. □ 

Next consider the mutual information I(f(X),g(Y)) for arbitrary measur- 
able mappings / and g of X and Y . From Lemma 5.5.2 applied to the random 
variables f(X) and g(Y), this mutual information can be approximated arbi- 
trarily closely by I(qi(f(X));q 2 (g(Y))) by an appropriate choice of quantizers 
qi and < 72 - Since the composition of q\ and / constitutes a finite quantization 
of X and similarly <72 <7 is a quantizer for Y, we must have that 

I(f(X)', g(Y)) « I( qi (f(X));q 2 (g(Y)) < I(X; Y). 

Making this precise yields the following corollary. 
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Corollary 5.5.6: If / is a measurable function of X and g is a measurable 
function of Y, then 

I(f(X),g(Y))<I(X-Y). 

The corollary states that mutual information is reduced by any measurable 
mapping, whether finite or not. For practice we point out another proof of 
this basic result that directly applies a property of divergence. Let P = Pxy, 
M = P x x P Yl and define the mapping r{x,y) = ( f(x),g(y )). Then from 
Corollary 5.2.2 we have 



I(X;Y) = D(P\\M ) > D{P r \\M r ) > D(P f{x)MY) \\M f{x)MY) ). 

But M f{x)MY) = P f(x) x P giY) since 

Mf( X ),g(y)(F x x F z ) = M(f~ 1 (F x )^g~ 1 (F Y ) 

= P X (f-\F x )) x P Y {g~\F Y )) = P nx) (F x ) x P g(Y) {F Y ). 

Thus the previous inequality yields the corollary. □ 

For the remainder of this section we focus on conditional entropy and infor- 
mation. 

Although we cannot express mutual information as a difference of ordinary 
entropies in the general case (since the entropies of nondiscrete random variables 
are generally infinite), we can obtain such a representation in the case where one 
of the two variables is discrete. Suppose we are given a joint distribution P XY 
and that X is discrete. We can choose a version of the conditional probability 
given Y so that p x \ Y {x \y) = P{X = x\Y = y) is a valid pmf (considered as a 
function of x for fixed y) with P Y probability 1. (This follows from Corollary 
5.8.1 of [50] since the alphabet of X is discrete; the alphabet of Y need not be 
even standard.) Define 

H(X{Y = y) = I>|rWt,)ln 



and 

H(X\Y) = j H(X\Y=y) dP Y (y). 

Note that this agrees with the formula of Section 2.5 in the case that both 
alphabets are finite. The following result is due to Wyner [152]. 

Lemma 5.5.6: If X, Y are random variables and X has a finite alphabet, 
then 

I(X;Y) = H(X) - H(X\Y). 

Proof: We first claim that p x \Y(x\y)/p X {x) is a version of dP XY / d(P x x P Y ) . 
To see this observe that for F € B(A X x Ay), letting F y denote the section 
{x : (x,y) £ F} we have that 



Px\y(x\v) 

Px (x) 



d{P X X Py) 




Px\v{x\y) 

Px(x) 



p x (x)dPy(y ) 
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= j dP Y (y) Y Px\y{x\v) = J dP Y (y)Px{F y \y) = P X y{F). 



XGFy 



Thus 



I(X-Y)= f In ( PXlY( , X \ y) )dP XY 

J PX{X) 



= H{X)+( dP Y (y)YPx\Y(x\y)lnpx\ Y (x\y). □ 

x 

We now wish to study the effects of quantizing on conditional information. 
As discussed in Section 2.5, it is not true that /(X; Y\Z) is always greater than 
/(/(X); q(Y)\r(Z)) and hence that I(X;Y\Z) can be written as a supremum 
over all quantizers and hence the definition of (5.34) and the formula (5.36) 
do not have the intuitive counterpart of a limit of informations of quantized 
values. We now consider an alternative (and more general) definition of condi- 
tional mutual information due to Wyner [152]. The definition has the form of a 
supremum over quantizers and does not require the existence of the probability 
measure P X xY\z and hence makes sense for alphabets that are not standard. 
Given Pxyz and any finite measurements / and g on X and Y, we can choose 
a version of the conditional probability given Z = z so that 



Pz(a,b ) = Pr(/(X) = a,g{Y) = b\Z = z) 



is a valid pmf with probability 1 (since the alphabets of / and g are finite and 
hence standard a regular conditional probability exists from Corollary 5.8.1 of 
[50]). For such finite measurements we can define 



I(f(Xy,g(Y)\Z = z)= Y E p z (a, b ) In 

ctG-A/ b£Ag 



Pz{a,b) 

£a' PM'MYlb' Pz{a,b'Y 



that is, the ordinary discrete average mutual information with respect to the 
distribution p z . 

Lemma 5.5.7: Define 



F(X; Y\Z) = sup f dP z (z)I(f(X); g(Y)\Z = z), 
f,g J 

where the supremum is over all quantizers. Then there exist sequences of quan- 
tizers (as in Lemma 5.5.5) such that 

I'(X-Y\Z) = lim I'(q m (Xy,r m (Y)\Z). 

n — >oo 

I ' satisfies Kolmogorov’s formula, that is, 

I\X ■ Y\Z) = /((X, zy Y) - I(Y ; Z). 

If the alphabets are standard, then 



I(X-Y\Z) = I\X-Y\Z). 
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Comment: The main point here is that conditional mutual information can 
be expressed as a supremum or limit of quantizers. The other results simply 
point out that the two conditional mutual informations have the same relation 
to ordinary mutual information and are (therefore) equal when both are defined. 
The proof follows Wyner [152]. 

Proof: First observe that for any quantizers q and r of Af and A g we have 
from the usual properties of mutual information that 

I(q(f(X)y,r(g(Y))\Z = z) < I(f(X); g(Y)\Z = z ) 

and hence integrating we have that 

I'(q(f(X)); r(g(Y))\Z) = J /(</(/(*)); r(g(Y))\Z = z) dP z (z ) (5.46) 

< J I(f(X); g(Y)\Z = z) dP z (z) 

and hence taking the supremum over all q and r to get I'(f(X); g(Y)\Z) yields 
I'(f(X);g(Y)\Z) = f I(f(X);g(Y)\Z = z)dP z (z). (5.47) 

so that (5.46) becomes 

I'(q(f(X)); r(s(Y))\Z) < /'(/(X); g{Y)\Z) (5.48) 

for any quantizers q and r and the definition of 1 1 can be expressed as 

I'(X-Y\Z) =supT(f(X)-g(Y)\Z), (5.49) 

f,g 

where the supremum is over all quantizers / and g. This proves the first part of 
the lemma since the supremum can be approached by a sequence of quantizers. 
Next observe that 

I'(f(Xy,g(Y)\Z) = J I(f(Xy,g(Y)\Z = z)dP z (z) 

= H(g(Y)\Z)-H{g(Y)\f(X),Z). 

Since we have from Lemma 5.5.6 that 

I(g(Yy,Z) = H(g(Y))-H(g(Y)\Z), 
we have by adding these equations and again using Lemma 5.5.6 that 



I(g(Y)- Z) + l'(f(X)-g(Y)\Z) = H(g(Y)) - H(g(Y)\f(X),Z) 
= I((f(X),Z);g(Y)). 
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Taking suprema over both sides over all quantizers / and g yields the relation 
I{X- Z) + J'(X; Y\Z) = /((X, Z); Y), 

proving Kolmogorov’s formula. Lastly, if the spaces are standard, then from 
Kolmogorov’s inequality for the original definition (which is valid for the stan- 
dard space alphabets) combined with the above formula implies that 

/'(X; Y\Z) = I((X, Z); Y) - /(X; Z) - /(X; Y\Z).D 



5.6 Some Convergence Results 

We now combine the convergence results for divergence with the definitions 
and properties of information densities to obtain some convergence results for 
information densities. Unlike the results to come for relative entropy rate and 
information rate, these are results involving the information between a sequence 
of random variables and a fixed random variable. 

Lemma 5.6.1: Given random variables X and Yi, Y 2 , ■ ■ • defined on a com- 
mon probability space, 

lim J(X; (Y u Y 2 , ■ ■ • , Y n )) = /(X; (Y 1; Y 2 , • • •))• 

n—> 00 

If in addition /(X; (Yi, Y 2 , • • •)) < 00 and hence Px x Py 1 ,y 2 ,-- » Px,y i,y 2 ,— > 
then 

*X;yi,y 2 , ■■■Xn ~ > *X;Yi,V 2) ... 

n— ► 00 

in L 1 . 

Proof: The first result follows from Corollary 5.2.5 with X, Yi, Y 2 , • • • , Y„_i 
replacing X n , P being the distribution Px,Yi, -i aR d M being the product dis- 
tribution P x x Py 1 ,y 2 , The density result follows from Lemma 5.4.1. □ 

Corollary 5.6.1: Given random variables X, Y, and Zi, Z 2 , • • • defined on 
a common probability space, then 

lim /(X; Y|Z 1} Z 2 , • • • , Z„) = /(X; Y|Z 1; Z 2 , ■ ■ •)• 

n — >-oo 



if 



7((X,Z^--);Y) <oo, 



( e.g., if Y has a finite alphabet and hence 7((X, Zi, • • •); Y) < P(Y) < 00 ), 
then also 

ix-,Y\z 1 ,---,z n —* ix(Y\z u - (5.50) 

1 n—> 00 1 

in L 1 . 

Proof: From Kolmogorov’s formula 



7(X; Y|Z 1; Z 2 , • • • , Z n ) 
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I(X;(Y,Z 1 ,Z 2 ,---,Z n ))-I(X;Z 1 ,---,Z n ) > 0. (5.51) 

From the previous lemma, the first term on the left converges as n — > oo to 
I(X\ (Y, Zi, ■ • •)) and the second term on the right is the negative of a term con- 
verging to I(X\ (Z i, •••)). If the first of these limits is finite, then the difference 
in (5.51) converges to the difference of these terms, which gives (5.50). From 
the chain rule for information densities, the conditional information density is 
the difference of the information densities: 

ix-,Y\Z 1 ,...,Z n = ix-,{Y,Z lt — ,z n ) ~ ix-,(z u —,z n ) 

which is converging in Zdx to 

ix-,Y\Z u — = ix-,{Y,Z u — ) ~ *X;(Zi,-)) 

again invoking the density chain rule. If I{X\ Y|Zi, • • •) = oo then quantize Y 
as q(Y) and note since q{Y) has a finite alphabet that 

I(X-Y\Z u Z 2 ,---,Z n )>I(X-q{Y)\Z u Z 2 ,---,Z n ) - I(X-q{Y)\Z u ---) 

n—> oo 



and hence 

liminf I{X\ Y\Z±, •-■■)> I(X; q(Y)\Zi, ■ ■ •). 

N — >oo 

Since the right-hand term above can be made arbitrarily large, the remaining 
part of the lemma is proved. □ 

Lemma 5.6.2: If 



P X X Py ll Y 2 ,— » Px,Yi,Y2, - 
(e.g., I{X\ (Yi, Y 2 , • • •)) < oo), then with probability 1. 

lim —i(X; (Yi, ■ • • , Y n )) = 0. 

n—> oo 77, 

Proof: This is a corollary of Theorem 5.4.1. Let P denote the distribution of 
{X,Yi,Y 2 ,- ■ ■} and let M denote the distribution P\ x Py li .... By assumption 
M » P. The information density is 



where P n and M n are the restrictions of P and M to cr(X, Yi, • • • Y„). Theorem 
5.4.1 can therefore be applied to conclude that P- a.e. 



lim — In 

n—> oo 77, 



dP n 

dM n 



= 0 , 



which proves the lemma. □ 

The lemma has the following immediate corollary. 
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Corollary 5.6.2: If {X n } is a process with the property that 
I(Xo', X_i,X_2, ■ ■ ■) < oo, 

that is, there is a finite amount of information between the zero time sample 
and the infinite past, then 

lim - i(X 0 ;X_ X- n ) = 0. 

n— ► oo 77 , 

If the process is stationary, then also 

lim -i(X n ; X n ) = 0. 

n —* oo 77, 




Chapter 6 



Information Rates II 



6.1 Introduction 

In this chapter we develop general definitions of information rate for processes 
with standard alphabets and we prove a mean ergodic theorem for information 
densities. The L 1 results are extensions of the results of Moy [105] and Perez 
[123] for stationary processes, which in turn extended the Shannon-McMillan 
theorem from entropies of discrete alphabet processes to information densities. 
(See also Kieffer [85].) We also relate several different measures of information 
rate and consider the mutual information between a stationary process and its 
ergodic component function. In the next chapter we apply the results of Chapter 
5 on divergence to the definitions of this chapter for limiting information and 
entropy rates to obtain a number of results describing the behavior of such 
rates. In Chapter 8 almost everywhere ergodic theorems for relative entropy 
and information densities are proved. 



6.2 Information Rates for General Alphabets 

Suppose that we are given a pair random process {X n , Y n } with distribution p. 
The most natural definition of the information rate between the two processes 
is the extension of the definition for the finite alphabet case: 

I(X ; Y) = limsup - I{X n ; Y n ). 

n—> oo R 

This was the first general definition of information rate and it is due to Do- 
brushin [32]. While this definition has its uses, it also has its problems. Another 
definition is more in the spirit of the definition of information itself: We formed 
the general definitions by taking a supremum of the finite alphabet definitions 
over all finite alphabet codings or quantizers. The above definition takes the 
limit of such suprema. An alternative definition is to instead reverse the order 
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and take the supremum of the limit and hence the supremum of the informa- 
tion rate over all finite alphabet codings of the process. This will provide a 
definition of information rate similar to the definition of the entropy of a dy- 
namical system. There is a question as to what kind of codings we permit, that 
is, do the quantizers quantize individual outputs or long sequences of outputs. 
We shall shortly see that it makes no difference. Suppose that we have a pair 
random process {X n ,Y n } with standard alphabets A x and A Y and suppose 
that / : A “ — > A f and g : Ay — > A g are stationary codings of the X and Y 
sequence spaces into a finite alphabet. We will call such finite alphabet sta- 
tionary mappings sliding block codes or stationary digital codes. Let { f n ,gn } 
be the induced output process, that is, if T denotes the shift (on any of the 
sequence spaces) then f n {x,y) = f{T n x) and g n (x,y) = g(T n y). Recall that 
f(T n (x,y)) = f n (x,y), that is, shifting the input n times results in the output 
being shifted n times. 

Since the new process {/„, g n } has a finite alphabet, its mutual information 
rate is defined. We now define the information rate for general alphabets as 
follows: 

I*(X; Y) = sup /(/;<?) 

sliding block codes f.g 

sup limsup —/(/"; g n ). 

sliding block codes f,g n ^°° n 

We now focus on AMS processes, in which case the information rates for 
finite alphabet processes (e.g., quantized processes) is given by the limit, that 
is, 

I*(X;Y)= sup I{f\9) 

sliding block codes f.g 

sup lim -I{f n -g n ). 

sliding block codes f.g IWO ° n 

The following lemma shows that for AMS sources I* can also be evaluated by 
constraining the sliding block codes to be scalar quantizers. 

Lemma 6.2.1: Given an AMS pair random process { X n , Y n } with standard 
alphabet, 

I*(X;Y) = sup I(q(X)-,r(Y)) = sup lim sup -I(q(X) n - r(Y) n ), 

q,r q,r n—> oo Tl 

where the supremum is over all quantizers q of Ay and r of Ay and where 
q(X) n = q(X 0 ),--.,q(X n _ 1 ). 

Proof: Clearly the right hand side above is less than I* since a scalar quan- 
tizer is a special case of a stationary code. Conversely, suppose that / and g 
are sliding block codes such that /(/; g) > I*(X\Y ) — e. Then from Corollary 
4.3.1 there are quantizers q and r and codes f and g' depending only on the 
quantized processes q{X n ) and r(Y n ) such that I{f\g’) > I{f\g) — e. From 
Lemma 4.3.3, however, I(q(X); r(Y)) > I(f',g') since /' and g' are stationary 
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codings of the quantized processes. Thus I{q{X)\ r(Y)) > I*(X; Y) — 2e, which 
proves the lemma. □ 

Corollary 6.2.1: 

I*(X; Y) < I(X; Y). 

If the alphabets are finite, then the two rates are equal. 

Proof: The inequality follows from the lemma and the fact that 

I{X n \ Y n ) > I(q(X) n ; r(Y) n ) 

for any scalar quantizers q and r (where q(X) n is q(Xo), ■ ■ ■ ,q(X n _ i)). If 
the alphabets are finite, then the identity mappings are quantizers and yield 
I(X n ■ Y n ) for all n. □ 

Pinsker [125] introduced the definition of information rate as a supremum 
over all scalar quantizers and hence we shall refer to this information rate as 
the Pinsker rate. The Pinsker definition has the advantage that we can use the 
known properties of information rates for finite alphabet processes to infer those 
for general processes, an attribute the first definition lacks. 

Corollary 6.2.2: Given a standard alphabet pair process alphabet Ax x Ay 
there is a sequence of scalar quantizers q m and r m such that for any AMS pair 
process {X n ,Y n } having this alphabet (that is, for any process distribution on 
the corresponding sequence space) 

I(X n -Y n ) = lim I(q m (X) n ;r m (Y) n ) 

m — »oo 

I*(X;Y)= lim I(q m (X); r m (Y)). 

m — >-oo 

Furthermore, the above limits can be taken to be increasing by using finer and 
finer quantizers. Comment: It is important to note that the same sequence of 
quantizers gives both of the limiting results. 

Proof: The first result is Lemma 5.5.5. The second follows from the previous 
lemma. □ 

Observe that 

I*(X;Y)= lim limsup -I(q m (Xy,r m (Y)) 

m >oo n — >OC) ft 

whereas 

I(X;Y) = limsup lim -I(q m (X); r m (Y)). 

tj-xx> m—>oo n 

Thus the two notions of information rate are equal if the two limits can be 
interchanged. We shall later consider conditions under which this is true and 
we shall see that equality of these two rates is important for proving ergodic 
theorems for information densities. 

Lemma 6.2.2: Suppose that {X n ,Y n } is an AMS standard alphabet ran- 
dom process with distribution p and stationary mean p. Then 

r p {x-Y) = r P {x-Y). 
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I* is an affine function of the distribution p. If p has ergodic decomposition p xy , 
then 

Ip(X;Y) = j dp(x,y)I* Pxy (X;Y). 

If / and g are stationary codings of X and Y, then 

I p(f;a) = J dp(x,y)Ip xy (f; g). 

Proof: For any scalar quantizers q and r of X and Y we have that I p (q(X)-, r(Y)) 
Ip(q(X);r(Y)). Taking a limit with ever hirer quantizers yields the first equal- 
ity. The fact that I* is affine follows similarly. Suppose that p has ergodic 
decomposition p xy . Define the induced distributions of the quantized process 
by to and m xy , that is, m(F) = p(x,y : { q(xi),r(yi ); i £ T} £ F ) and similarly 
for m xy . It is easy to show that to is stationary (since it is a stationary coding 
of a stationary process), that the m xy are stationary ergodic (since they are 
stationary codings of stationary ergodic processes), and that the m xy form an 
ergodic decomposition of to. If we let X' n ,Yf denote the coordinate functions 
on the quantized output sequence space (that is, the processes {q(X n ),r(Y n )} 
and {X' n ,Yf} are equivalent), then using the ergodic decomposition of mutual 
information for finite alphabet processes (Lemma 4.3.1) we have that 

T P (q(X)- r{Y)) = I m (X'- Y') = j Im x , y , (X 1 -, Y') dm(x', y') 

= J ip* y (q( x y,r(Y))dp(x,y). 

Replacing the quantizers by the sequence q m , r m the result then follows by 
taking the limit using the monotone convergence theorem. The result for sta- 
tionary codings follows similarly by applying the previous result to the induced 
distributions and then relating the equation to the original distributions. □ 

The above properties are not known to hold for / in the general case. Thus 
although I may appear to be a more natural definition of mutual information 
rate, I* is better behaved since it inherits properties from the discrete alphabet 
case. It will be of interest to find conditions under which the two rates are the 
same, since then J will share the properties possessed by I*. The first result of 
the next section adds to the interest by demonstrating that when the two rates 
are equal, a mean ergodic theorem holds for the information densities. 



6.3 A Mean Ergodic Theorem for Densities 

Theorem 6.3.1: Given an AMS pair process { X n ,Y n } with standard alpha- 
bets, assume that for all n 



P X n X P Y n > > P X n Y 
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and hence that the information densities 



ix n -Y n 



= In 



dPx n ,Y n 
d(Px n X Py n ) 



are well defined. For simplicity we abbreviate ix n ;Y n to i n when there is no 
possibility of confusion. If the limit 

lim -I(X n -Y n ) = I{X-Y) 

n—> oo n 

exists and 

J{X-Y) = I*{X-Y) <oo, 

then n~ 1 i n (X n \Y n ) converges in L 1 to an invariant function i(X\Y). If the 
stationary mean of the process has an ergodic decomposition p xy , then the 
limiting density is I* Pxy (X;Y), the information rate of the ergodic component 
in effect. 

Proof: Let q m and r m be asymptotically accurate quantizers for A\ and Ay. 
Define the discrete approximations X n = q m (X n ) and Y n = r m (Y n ). Observe 
that Px n x Py n » Px n Y n implies that P^ n x Py n » Px n y n and hence we 
can define the information densities of the quantized vectors by 

-i __ , dPx n Yn 

,n -'"d(P x .yP f .Y 



For any in we have that 

J \U n (x n ;y n ) - I* Pxy (X;Y)\dp(x,y) < 



\-i n {x n -,y n ) - -i n (qm{x) n -,r m {y) n )\dp(x,y)+ 
n n 



I -in{qm(x) n ;rm(y) n ) - I Pxy (q m (X)-r m (Y))\ dp{x,y) + 



J \I Pxy (q m (Xy,r m (Y)) - P Pxy (X;Y)\dp(x,y), (6.1) 

where 

= (SLm{% o)? * * ’ 5 

r m {y) n = {r m (yo)y ■ ■ Hm{yn-l)), 

and I p (q m (X)-, r m (Y)) denotes the information rate of the process {q m (X n ) , r m (Y n ) 
n = 0, 1, • ■ - , } when p is the process measure describing {X n , Y n }. 

Consider first the right-most term of (6.1). Since I* is the supremum over 
all quantized versions, 



I Ip X v(qm(Xy,r m (Y)) - I* Pxy (X-Y ) | dp(x,y) 
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= I (I*p xv (X;Y) - I Pxy (q m (X)-r m (Y)))dp(x,y). 

Using the ergodic decomposition of I* (Lemma 6.2.2) and that of J for discrete 
alphabet processes (Lemma 4.3.1) this becomes 

J I ip xy (q m (X)-,r m (Y)) - r Pxy (X-Y ) | dp(x,y) 

= I* p {X-Y) - I p (q m (X)-r m (Y)). (6.2) 

For fixed m the middle term of (6.1) can be made arbitrarily small by taking 
n large enough from the finite alphabet result of Lemma 4.3.1. The first term on 
the right can be bounded above using Corollary 5.2.6 with T = a{q{X ) n ; r(Y) n ) 
by 

~(I(X n ; Y n ) - I(X n - Y n ) + -) 
n e 

which as n — > oo goes to I(X\ Y) —I(q m (X); r m (Y)). Thus we have for any m 
that 

limsup [ \-i n (x n ;y n ) - I* Pxy (X;Y)\dp(x,y) 

n — xx) J Tl 

< I{X; Y) - I(q m (X)- r m (Y )) + I*(X; Y ) - I(q m (X); r m (Y )) 
which as m — * oo becomes I{X\ Y) — I*(X\ Y ), which is 0 by assumption. □ 



6.4 Information Rates of Stationary Processes 

In this section we introduce two more definitions of information rates for the 
case of stationary two-sided processes. These rates are useful tools in relating 
the Dobrushin and Pinsker rates and they provide additional interpretations 
of mutual information rates in terms of ordinary mutual information. The 
definitions follow Pinsker [125]. 

Henceforth assume that { X n , F„} is a stationary two-sided pair process with 
standard alphabets. Define the sequences y = {yp, i G T} and Y = {Yp. i € T} 
First define 

7(X; Y) = limsup — I(X n -, Y), 

n—> oo R 

that is, consider the per-letter limiting information between ?i-tuples of X and 
the entire sequence from Y. Next define 

r(X-Y) = I(X 0 -Y\X_ 1 ,X_ 2 ,---), 

that is, the average conditional mutual information between one letter from X 
and the entire Y sequence given the infinite past of the X process. We could 
define the first rate for one-sided processes, but the second makes sense only 
when we can consider an infinite past. For brevity we write X~ = X_i, X_ 2 , • • • 
and hence 



I-{X-Y) = I{X 0 -Y\X~). 
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Theorem 6.4.1: 

J(X; Y) > 7(X; Y) > J*(X; Y) > /"(X; Y). 

If the alphabet of X is finite, then the above rates are all equal. 

Comment: We will later see more general sufficient conditions for the equal- 
ity of the various rates, but the case where one alphabet is finite is simple and 
important and points out that the rates are all equal in the finite alphabet case. 

Proof: We have already proved the middle inequality. The left inequality 
follows immediately from the fact that I(X n ;Y) > I(X n ;Y n ) for all n. The 
remaining inequality is more involved. We prove it in two steps. First we prove 
the second half of the theorem, that the rates are the same if X has finite 
alphabet. We then couple this with an approximation argument to prove the 
remaining inequality. Suppose now that the alphabet of X is finite. Using the 
chain rule and stationarity we have that 

n - 1 

— 7(X"; Y n ) = - V J(X <; F”|Xo, ■ ■ ■ , Xi- 1 ) 

n n 

i = 0 

n— 1 

= -Y^I{X 0 -YT\X_ u --.,X_ i ), 

71 i—0 

where Yff i is Y_,, • • • , Y_j +n _i, that is, the n - vector starting at —i. Since X has 
finite alphabet, each term in the sum is bounded. We can show as in Section 5.5 
(or using Kolmogorov’s formula and Lemma 5.5.1) that each term converges as 
i — * oo, n — » oo, and n — i — > oo to /(X o; Y|X_i, X_ 2 , • • •) or /“(X; Y). These 
facts, however, imply that the above Cesaro average converges to the same limit 
and hence I = I~ . We can similarly expand I as 

^ n— 1 ^ n— 1 

- V I{Xi- Y|X 0 , • • • , Xi_i) = - V I(X o; y |X_!, • • • , X_i), 

n ' n ' 

i—0 i—0 

which converges to the same limit for the same reasons. Thus I = I = I~ for 
stationary processes when the alphabet of X is finite. Now suppose that X 
has a standard alphabet and let q m be an asymptotically accurate sequences of 
quantizers. Recall that the corresponding partitions are increasing, that is, each 
refines the previous partition. Fix e > 0 and choose m large enough so that the 
quantizer a(Xo) = q m (Xo ) satisfies 

/(a(X 0 ); Y |X") > J(X 0 ; Y |X") - e. 

Observe that so far we have only quantized X 0 and not the past. Since 
T m = cr(a(X' 0 ), Y, g m (X_j); i = 1,2,---) 
asymptotically generates 

<j(a(X 0 ), Y, X-i m , i = 1,2,---), 
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given e we can choose for m large enough (larger than before) a quantizer 0(x) = 
q m (x) such that if we define (3{X~ ) to be 0{X_i), /3(X_ 2 ), • • •, then 

|/(a(X 0 ); (Y,0(X~))) - I(a(X 0 )- (y, X"))| < e 

and 

|/(a(X 0 ); 0{X~)) - I(a(X „); X~)\ < e. 

Using Kolmogorov’s formula this implies that 

\I(a(X 0 y,Y\X~) - I(a(X 0 y, Y\0(X~))\ < 2e 

and hence that 

I(a(X o); Y\0{X~)) > I(a(X 0 ); Y\X~) - 2e > /(X 0 ; Y \X~) - 3e. 

But the partition corresponding to 0 refines that of a and hence increases the 
information; hence 

I(0(X 0 ); Y\0{X~)) > I(a(X 0 )-, Y\0(X~)) > I(X 0 - Y\X~) - 3e. 

Since 0(X n ) has a finite alphabet, however, from the finite alphabet result the 
left-most term above must be I(0(X)\Y), which can be made arbitrarily close 
to I*(X\ y). Since e is arbitrary, this proves the final inequality. □ 

The following two theorems provide sufficient conditions for equality of the 
various information rates. The first result is almost a special case of the second, 
but it is handled separately as it is simpler, much of the proof applies to the 
second case, and it is not an exact special case of the subsequent result since it 
does not require the second condition of that result. The result corresponds to 
condition (7.4.33) of Pinsker [125], who also provides more general conditions. 
The more general condition is also due to Pinsker and strongly resembles that 
considered by Barron [9]. 

Theorem 6.4.2: Given a stationary pair process { X n ,Y n } with standard 
alphabets, if 

I(X 0 ; (X_!, X_2, • • •)) < 0 °, 

then 

I(X; Y) = I(X; Y) = J*(X; Y) = I~(X- Y). (6.3) 

Proof: We have that 

-I{X n -Y) < -I{X n -,(Y,X~)) 
n n 

= -I(X n -X~) + -I{X n -,Y \X~), (6.4) 

n n 

where, as before, X~ = {X_i,X_ 2 , • • •}. Consider the first term on the right. 
Using the chain rule for mutual information 

1 1 " -1 

-i(x n -,x~) = -j2nx i -,x-\x i ) 

n n 
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= -Y,{ I {Xi-,{X i , x ~))-I{X i -X i )). (6.5) 

i = o 

Using stationarity we have that 
1 1 n_1 

- I{X n ; X-) = - V(/(X 0 ; X-) - /(*„; (X_ 1; • • • , X_,)). (6.6) 

n n z ' 

i=0 

The terms /(X 0 ; (X_i, • • • , X_j)) are converging to I(X 0 \ X~), hence the terms 
in the sum are converging to 0, i.e., 

lim I{Xi\X-\X l ) = 0. (6.7) 

i—> oo 

The Cesaro mean of (6.5) is converging to the same thing and hence 

-I{X n -X~) -* 0. (6.8) 

n 

Next consider the term I(X n \ Y\X~). For any positive integers n,m we have 

I(X n+m ;Y\X~ ) = I(X n ;Y\X~) + I(X™-,Y\X- 3 X n ), (6.9) 

where X™ = X n , • • • , From stationarity, however, the rightmost term 

is just I(X m ; Y\X~) and hence 

I(X m+n ; Y\X~) = I{X n - Y\X~) + I{X m - Y\X~). (6.10) 

This is just a linear functional equation of the form f(n + m) = f(n) + /(m) 
and the unique solution to such an equation is f(n) = nf{ 1), that is, 

-I(X n ;Y \X~) = I(X 0 -Y \X~) = I-(X-Y). (6.11) 

n 

Taking the limit supremum in (6.4) we have shown that 

I(X-Y)<r(X;Y), (6.12) 

which with Theorem 6.4.1 completes the proof. □ 

Intuitively, the theorem states that if one of the processes has finite average 
mutual information between one symbol and its infinite past, then the Dobruslrin 
and Pinsker information rates yield the same value and hence there is an L 1 
ergodic theorem for the information density. 

To generalize the theorem we introduce a condition that will often be useful 
when studying asymptotic properties of entropy and information. A stationary 
process {X n } is said to have the finite-gap information property if there exists 
an integer K such that 



I(X k -X~\X k ) <oo, 



(6.13) 




128 



CHAPTER 6. INFORMATION RATES II 



where, as usual, X~ = (X_ 1 ,X_ 2 ,- ••)■ When a process has this property for 
a specific K , we shall say that it has the K- gap information property. Observe 
that if a process possesses this property, then it follows from Lemma 5.5.4 

I(X K -,(X_ 1 ,---,X_ l )\X K )<<x>; 1 = 1,2,- •• (6.14) 

Since these informations are finite, 

P ( X K J » P X n- n = (6.15) 

where P^V is the Kth order Markov approximation to Px n - 

Theorem 6.4.3: Given a stationary standard alphabet pair process {X n , Y n }, 
if {X n } satisfies the finite-gap information property (6.13) and if, in addition, 

I(X k -Y)< oo, (6.16) 

then (6.3) holds. 

(If K = 0 then there is no conditioning and (6.16) is trivial, that is, the 
previous theorem is the special case with I\ = 0.) 

Comment: This theorem shows that if there is any finite dimensional future 
vector (Xk,X x+ i, ■ ■ ■ , Xx+n-i) which has finite mutual information with re- 
spect to the infinite past X~ when conditioned on the intervening gap (Xo, • • • , X x ~ i), 
then the various definitions of mutual information are equivalent provided that 
the mutual information betwen the “gap” X K and the sequence Y are finite. 

Note that this latter condition will hold if, for example, J(X; Y) is finite. 

Proof: For n > K 

— 7(X”; Y) = - I(X k -Y ) + -I(XY~ k - Y\X k ). 
n n n 

By assumption the first term on the left will tend to 0 as n — > oo and hence we 
focus on the second, which can be broken up analogous to the previous theorem 
with the addition of the conditioning: 

-I(X n K ~ K -Y\X K ) < -I{X n K ~ K -{Y,X-\X K )) 
n n 

= -I(Xf ( - K ;X~\X K ) + -I(Xf ( - K ;Y\X-,X K ). 
n n 

Consider first the term 

1 1 "” 1 

-i{x n K ~ K -x-\x K ) = -Y J i{x i -,x-\x% 

n n 

i—K 

which is as (6.5) in the proof of Theorem 6.4.2 except that the first I\ terms 
are missing. The same argument then shows that the limit of the sum is 0. The 
remaining term is 

-I(X 7 X k - Y\X~,X k ) = -I(X n ; Y\X~) 
n n 
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exactly as in the proof of Theorem 6.4.2 and the same argument then shows 
that the limit is I~(X\ Y), which completes the proof. □ 

One result developed in the proofs of Theorems 6.4.2 and 6.4.3 will be im- 
portant later in its own right and hence we isolate it as a corollary. The result 
is just (6.7), which remains valid under the more general conditions of Theorem 
6.4.3, and the fact that the Cesaro mean of converging terms has the same limit. 
Corollary 6.4.1: If a process {X n } has the finite-gap information property 

I(X K ;X~ \X K ) < oo 



for some K, then 

lim I(X n -X~ \X n ) = 0 

n—* oo 

and 

lim -I(X n ;X~)= 0. 

n—> oo 77 

The corollary can be interpreted as saying that if a process has the the finite 
gap information property, then the mutual information between a single sample 
and the infinite past conditioned on the intervening samples goes to zero as the 
number of intervening samples goes to infinity. This can be interpreted as a 
form of asymptotic independence property of the process. 

Corollary 6.4.2: If a one-sided stationary source {X n } is such that for some 
K, I{X n -X n ~ K \X%_ K ) is bounded uniformly in n, then it has the finite-gap 
property and hence 

~I{X-Y) = r{X-Y). 

Proof: Simply imbed the one-sided source into a two-sided stationary source 
with the same probabilities on all finite-dimensional events. For that source 

I(X n ;X n - K \X?_ K ) =I(X K ;X_ 1 ,---,X_ n _ K \X K ) - I(X K] X~ \X K ). 

n— >oo 

Thus if the terms are bounded, the conditions of Theorem 6.4.2 are met for the 
two-sided source. The one-sided equality then follows. □ 

The above results have an information theoretic implication for the ergodic 
decomposition, which is described in the next theorem. 

Theorem 6.4.4: Suppose that { X n } is a stationary process with the finite- 
gap property (6.13). Let if be the ergodic component function of Theorem 1.8.3 
and suppose that for some n 



I{X n ] if) < oo. (6.17) 

(This will be the case, for example, if the finite-gap information property holds 
for 0 gap, that is, I{Xq\X~) < oo since ip can be determined from X~ and 
information is decreased by taking a function.) Then 

lim -I{X n - ip) = 0. 

n—> oo 77 
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Comment: For discrete alphabet processes this theorem is just the ergodic 
decomposition of entropy rate in disguise (Theorem 2.4.1). It also follows for 
finite alphabet processes from Lemma 3.3.1. We shall later prove a correspond- 
ing almost everywhere convergence result for the corresponding densities. All 
of these results have the interpretation that the per-symbol mutual information 
between the outputs of the process and the ergodic component decreases with 
time because the ergodic component in effect can be inferred from the process 
output in the limit of an infinite observation sequence. The finiteness condition 
on some I(X n \ ip) is necessary for the nonzero finite-gap case to avoid cases such 
as where X n = ip for all n and hence 

I(X n -iP)=I(ip-iP)=H(iP)=oo, 

in which case the theorem does not hold. 

Proof: 

Define ip n = ip for all n. Since ip is invariant, {X n ,ip n } is a stationary 
process. Since X n satisfies the given conditions, however, I{X\ip) = I*(X\ip). 
But for any scalar quantizer q , I(q(X)-,ip) is 0 from Lemma 3.3.1. I*(X\ip) is 
therefore 0 since it is the supremum of I(q(X); ip) over all quantizers q. Thus 

0 = I(X;ip)= lim -I{X n -ip n )= lim -I{X n \ ip). □ 

n—*oo XI n —> oo XI 




Chapter 7 



Relative Entropy Rates 



7.1 Introduction 

This chapter extends many of the basic properties of relative entropy to se- 
quences of random variables and to processes. Several limiting properties of 
entropy rates are proved and a mean ergodic theorem for relative entropy densi- 
ties is given. The principal ergodic theorems for relative entropy and information 
densities in the general case are given in the next chapter. 



7.2 Relative Entropy Densities and Rates 

Suppose that p and to are two AMS distributions for a random process {X n } 
with a standard alphabet A. For convenience we assume that the random vari- 
ables {X n } are coordinate functions of an underlying measurable space (f2,23) 
where 0 is a one-sided or two-sided sequence space and B is the corresponding 
(j-field. Thus x £ hi has the form x = {xi}, where the index i runs from 
0 to oo for a one-sided process and from — oo to +00 for a two-sided pro- 
cess. The random variables and vectors of principal interest are X n (x) = x n , 
X n (x) = x n = (x 0 ,---,x n - 1 ), and Xf(x) = (x u ■ ■ ■ , Xi +k -i). The process 
distributions p and to are both probability measures on the measurable space 

(n,B). 

For n = 1, 2, • • • let M X n and Px n be the vector distributions induced by p 
and to. We assume throughout this section that Mj» >> P X n and hence that 
the Radon-Nikodym derivatives f X n = dP X n /dM X n and the entropy densities 
h X n = In f X n are well defined for all n = 1, 2, • • • Strictly speaking, for each n 
the random variable f X n is defined on the measurable space (A n , Ba™) and hence 
f X n is defined on a different space for each n. When considering convergence 
of relative entropy densities, it is necessary to consider a sequence of random 
variables defined on a common measurable space, and hence two notational 
modifications are introduced: The random variables f X n(X n ) : Q — > [0,oo) are 



131 




132 



CHAPTER 7. RELATIVE ENTROPY RATES 



defined by 

fx~(X n )(x) = f X n(X n (x)) = f X n(x n ) 

for n = 1,2, - • - . Similarly the entropy densities can be defined on the common 
space (Q. B) by 

h X n(X n ) =lnf x «(X n ). 

The reader is warned of the potentially confusing dual use of X n in this nota- 
tion: the subscript is the name of the random variable X n and the argument 
is the random variable X n itself. To simplify notation somewhat, we will often 
abbreviate the previous (unconditional) densities to 

fn = fxAX n y, h n = h X n(X n ). 

For n = 1, 2, • • • define the relative entropy by 

H p \\ m (X n ) = D(P X n\\M X n) = E Pxn h X n = E p h X n(X"). 

Define the relative entropy rate by 



H p \\m(X) = limsup -H p \\ m (X n ). 

n—> oo Tl 

Analogous to Dobrushin’s definition of information rate, we also define 
H* p \\ rn (X) = supH pllm (q(X)), 

4 



where the supremum is over all scalar quantizers q. 

Define as in Chapter 5 the conditional densities 

_ f X n + l __ dP X n+l/dM X n+l _ dP Xn \ X n 
fXnlXn ~ f X ~ - dP X n/dM X » “ dM Xn \ X n 



(7.1) 



provided f X n ^ 0 and f Xn \ X n = 1 otherwise. As for unconditional densities we 
change the notation when we wish to emphasize that the densities can all be 
defined on a common underlying sequence space. For example, we follow the 
notation for ordinary conditional probability density functions and define the 
random variables 



f Xn \ X AX n \X n ) 



/ X „ +1 (A"+ 1 ) 

f X n(Xn) 



and 

h Xn \ X n(X n \X n ) = In f Xn \ X n (X n \X n ) 



on (f 1,13). These densities will not have a simple abbreviation as do the uncon- 
ditional densities. 

Define the conditional relative entropy 



H p \\ m {X n \X n ) — E Pxn (In f Xn \ X n) — j dpln f Xn \ X n(X n \X n ). 



(7.2) 
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All of the above definitions are immediate applications of definitions of Chapter 
5 to the random variables X n and X n . The difference is that these are now 
defined for all samples of a random process, that is, for all n = 1 , 2 ,-- The 
focus of this chapter is the interrelations of these entropy measures and on some 
of their limiting properties for large n. 

For convenience define 

D n = H pllm (X n \X n )- n = 1,2, - - - , 

and Do = H p || m (Xo). From Theorem 5.3.1 this quantity is nonnegative and 

D n + D(P X n\\M X n) = D(P X n+l\\M X n + l). 

If D(P X n\\M x «) < oo, then also 

D n = D(P X n +1 \\M X „ +1 ) - D(P X n\\M X n). 

We can write D. n as a single divergence if we define as in Theorem 5.3.1 the 
distribution S x "+i by 

S X n +1 (F x G) = f M XnlX u(F\x n ) dP X n(x n y, F e B A \ G € B A n. (7.3) 

JF 

Recall that 5 'a'"+ 1 combines the distribution P X n on X n with the conditional 
distribution M x \ X n giving the conditional probability under M for X n given 
X n . We shall abbreviate this construction by 

S X n+l = M Xn \ X nP X n. (7.4) 

Then 

D n = D(P xn+1 \\S xn+ i). (7.5) 

Note that S X n+i is not in general a consistent family of measures in the sense 
of the Kolmogorov extension theorem since its form changes with n, the first 
n samples being chosen according to p and the final sample being chosen using 
the conditional distribution induced by in given the first n samples. Thus, 
in particular, we cannot infer that there is a process distribution s which has 
S X n: , n = 1, 2, • • • as its vector distributions. 

We immediately have a chain rule for densities 

n— 1 

fxn = n f Xi \ X i (7.6) 

»= o 

and a corresponding chain rule for conditional relative entropies similar to that 
for ordinary entropies: 

n — 1 n — 1 

D(P X n | \M X n) = H pllm (x n ) - ]T H^miXilX 1 ) = J2 Di. (7.7) 

2—0 2—0 
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7.3 Markov Dominating Measures 



The evaluation of relative entropy simplifies for certain special cases and re- 
duces to a mutual information when the dominating measure is a Markov ap- 
proximation of the dominated measure. The following lemma is an extension to 
sequences of the results of Corollary 5.5.2 and Lemma 5.5.4. 

Theorem 7.3.1: Suppose that p is a process distribution for a standard 
alphabet random process {X n } with induced vector distributions Px n \ n — 
1,2,-- -. Suppose also that there exists a process distribution m with induced 
vector distributions M X n such that 

(a) under m {X n } is a fc-step Markov source, that is, for all n > k, X n ~ k — > 
X k _ k —> X n is a Markov chain or, equivalently, 



XIx n \x n — M Xn \x k 



and 



(b) M X n » P X n, n = 1, 2, • • • so that the densities 



fx » 



dP x ™ 

dM X n 



are well defined. 



Suppose also that p ^ is the fc-step Markov approximation to p, that is, the 

(k) 

source with induced vector distributions P x „ such that 



p(C _ 

r x k ~ 



P X k 



and for all n > k 



P 



(fc) 



X n \X' 



= P 



X n \ X „ 



that is, p (k/) is a fc-step Markov process having the same initial distribution and 
the same fcth order conditional probabilities as p. Then for all n > k 



M X n » Pj>) » P X „ 



and 



Furthermore 



dP 



(fc) 



X ” 



dM 



A'" 



- fxn = f X k n fx,\X k _ k 



l=k 



dP 



x n 



dP 



(k) 



x n 



/a" 

/.(fc) 

Jx n 



X„\x n — h Xn \x k A i Xn . X n-k\x k _ 



(7.8) 

(7.9) 

(7.10) 

(7.11) 



and hence 



D n = H pllm (X n \X n ) 
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= I P (X n ; X n ~ k \X k n _ k ) + H pl]rn (X n \X k _ k ). 

Thus 

n— 1 

h X n = h x fc + + i Xi;X I - fc |X I fc _ fc ) 

l=k 

and hence 

D(P X n\\M X n) = H p]]m (X k ) + 

n—1 

Y^iipi^x'-^xt,) + \X k _ k )). 

l=k 

If m = p( k \ then for all n > k we have that h Xn | X k =0 and hence 





H pllpW (X n \X k _ k ) = 0 


(7.15) 


and 


and hence 


D n = I. p (X n -X n ~ k \X k _ k ), 

n—1 


(7.16) 




D(Px n \\P x l) = J2 I p( X ^ xl ~ k \ X i-k)- 

l=k 


(7.17) 



Proof: If n = k + 1, then the results follow from Corollary 5.3.3 and Lemma 
5.5.4 with X = X n , Z = X k , and Y = X k . Now proceed by induction and 
assume that the results hold for n. Consider the distribution Q X (.n+ 1 ) specified 
by Qx n = Px n and Qx„,\x n = Px n \x k • I n other words, 



(7.12) 

(7.13) 

(7.14) 



Application of Corollary 5.3.1 with Z = X n k ,Y = X k _ k , and X = X n implies 
that M X n+i » Qx n + 1 >> .FA"+ 1 and that 

dP X n+ 1 _ fx n \x n 
dQ X n+t f Xn \x*_ k 

This means that we can write 

P X ^{F)= [ ^f^dQ x ^ = [ dQ Xnlxn dQ X r, 



dP X ” + 1 7 TJ 

dP Xn \ X k dP x « 

F d O A'"+ ! 



From 



J t -A 

the induction hypothesis we can express 

p, , l(F \ = f dP X n+i dPx^ 

A " Jf dQxn+i dp (k l 



this as 



dP 



x„\x*_ 



dP 



,(fc) 



x n 
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_ r dP X n + 1 dP X n (fc) 

Jf dQ X n+i rf pW’^” +1 ’ 

(k) 

proving that P x „.+i » Px n + 1 and that 

dP X n + l _ dP X n + 1 dP X n _ fx n \X n dP X n 

dP^l+i d Qxn+idP ( x l fx n \x k _ k dP { x l' 



This proves the right hand part of (7.9) and (7.10). 
Next define the distribution by 



P X n{F)= [ f^ldMxn, 

Jf 

(k) ~ 

where is defined in (7.9). Proving that P\ n = P X n will prove both the left 
hand relation of (7.8) and (7.9). Clearly 



dP x » 

dM X n 



r(k) 

Jx n 



and from the definition of / ^ and conditional densities 



r(k) _ r(k) 

J x n \x n Jx n \x k _ k 



(7.18) 



From Corollary 5.3.1 it follows that X n k — > X k _ k — > X n is a Markov 
chain. Since this is true for any n > k, P X n is the distribution of a k- step 
Markov process. By construction we also have that 

(ns) 



and hence from Theorem 5.3.1 



p(C 

x n\x*_ k 



= P 



X n\X*_, ' 



Since also = fx k > Px n = P X 1 as claimed. This completes the proof of 
(7.8)-(7.10). Eq. (7.11) follows since 



fx n \x n 



fx n \X k _ k x 



fx n \x n 

fx n \X k _ k 



Eq. (7.12) then follows by taking expectations. Eq. (7.13) follows from (7.11) 
and 

n— 1 

fx n = fx k n fx t \xh 

l=k 

whence (7.14) follows by taking expectations. If m = p^ k \ then the claims 
follow from (5.27)-(5.28). □ 
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Corollary 7.3.1: Given a stationary source p, suppose that for some K 
there exists a K- step Markov source m with distributions Mj» >> P X n , n = 
1, 2, • • •. Then for all k > K (7.8)-(7.10) hold. 

Proof: If m is a A'-step Markov source with the property M X n » Px n , 
n = 1, 2, • • •, then it is also a k- step Markov source with this property for all 
k > K. The corollary then follows from the theorem. □ 

Comment: The corollary implies that if any A'-step Markov source domi- 
nates p on its finite dimensional distributions, then for all k > K the fc-step 
Markov approximations p <k> also dominate p on its finite dimensional distribu- 
tions. 

The following variational corollary follows from Theorem 7.3.1. 

Corollary 7.3.2: For a fixed k let Let Ai k denote the set of all fc-step 
Markov distributions. Then D(P x *\\M) is attained by P^ k \ and 

n— 1 

inf D{P X n\\M) = D{PxA\P { xh = £/ P (*b^-W- fc ). 

l=k 

Since the divergence can be thought of as a distance between probability 
distributions, the corollary justifies considering the k- step Markov process with 
the same fcth order distributions as the k- step Markov approximation or model 
for the original process: It is the minimum divergence distribution meeting the 
fc-step Markov requirement. 



7.4 Stationary Processes 



Several of the previous results simplify when the processes m and p are both sta- 
tionary. We can consider the processes to be two-sided since given a stationary 
one-sided process, there is always a stationary two-sided process with the same 
probabilities on all positive time events. When both processes are stationary, 
the densities fx n and f X n satisfy 



fx« 



dPx ^ ,. rpm dPx n rpm 

dM X n - JXU - dM X n 



and have the same expectation for any integer m. Similarly the conditional 
densities f Xn |x», fx k \x»_ n , and f Xo |x_ 1 ,x_ 2 ,...,x_ n satisfy 



fx„\x«- = fx k \x%_ n T n ~ k = /xo|A-_l,x_ 2 ,•••.x_ n ^ 1I^ (7.20) 

for any k and have the same expectation. Thus 
1 1 n ~ 1 

-H pllm (X n )= - (7.21) 

n n z ' 

2=0 

Using the construction of Theorem 5.3.1 we have also that 

Dt = Hp^XilX 1 ) = H pl{m (X 0 \X- U - ■ ■ , X_i) 
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— -D(-Pxo,X_i,-,A-_ i ||S , Xo,A-_i,-,A-_ 4 ), 

where now 

Sx 0 ,X- 1 ,-,X- i = M Xo \ x _ lt ... iX _ i P x _ lt ... >x _ i -, (7.22) 

that is, 

S Xo>x _ t ,-,X-i(F x G) = f M Xo \x^ 1 ,--,x. i {F\x l ) dPx_ 1 , --,x^ i {x 1 )-, 

Jf 

F £ B A ] G £ B A i. 

As before the S X n distributions are not in general consistent. For example, 
they can yield differing marginal distributions S x 0 . As we saw in the finite 
case, general conclusions about the behavior of the limiting conditional relative 
entropies cannot be drawn for arbitrary reference measures. If, however, we 
assume as in the finite case that the reference measures are Markov, then we 
can proceed. 

Suppose now that under m the process is a k- step Markov process. Then for 
any n > k (A_ n , • • • , X_k~ 2 , X-k-i) —> X k k — ■> X 0 is a Markov chain under m 
and Lemma 5.5.4 implies that 

H pllm (X 0 \X_ U - ■ ■ , A_ n ) = H pllm (X k \X k ) + I p (X k - (X_ u - ■ ■ , X_ n )\X k ) 

(7.23) 

and hence from (7.21) 

H p \\ m (X) = H pllm (X k \X k ) + I p {X k -X~\X k ). (7.24) 

We also have, however, that X~ — > X k — > X k is a Markov chain under m 
and hence a second application of Lemma 5.5.4 implies that 

H pllm (X 0 \X~) = H Mm {X k \X k ) + I p {X k -X~\X k ). (7.25) 

Putting these facts together and using (7.2) yields the following lemma. 

Lemma 7.4.1: Let {X n } be a two-sided process with a standard alphabet 
and let p and m be stationary process distributions such that M X n » P X n all 
n and m is fcth order Markov. Then the relative entropy rate exists and 

H p \\ m (X) = lim —H p \\ m (X n ) 

lim H p \\ m (Xo\X_i, • • • , X_ n ) Hp\\ m (Xo\X ) 

n—> oo 11 

= H p[lrn (X k \X k ) + I p {X k ; X- \X k ) 

= E p [\n f Xk \ X k (X k \X k )\ + I p (X k ;X~ \X k ). (7.26) 



Corollary 7.4.1: Given the assumptions of Lemma 7.4.1, 
H pllm (X N \X~) = NH pllm (X 0 \X-). 
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Proof: From the chain rule for conditional relative entropy (equation (7.7), 

n — 1 

H pllm (X N \X-) = Y J H p \\ m {X l \X\X~). 

1=0 

Stationarity implies that each term in the sum equals H p || m (JTo|X _ ), proving 
the corollary. □ 

The next corollary extends Corollary 7.3.1 to processes. 

Corollary 7.4.2: Given fc and n > k, let A4 k denote the class of all fc-step 
stationary Markov process distributions. Then 

inf H p \\ m (X) = H p]]pW (X) = I p {X k -X~\X k ). 

m£A4 k 

Proof: Follows from (7.23) and Theorem 7.3.1. □ 

This result gives an interpretation of the finite-gap information property 
(6.13): If a process has this property, then there exists a fc-step Markov process 
which is only a finite “distance” from the given process in terms of limiting 
per-symbol divergence. If any such process has a finite distance, then the fc- 
step Markov approximation also has a finite distance. Furthermore, we can 
apply Corollary 6.4.1 to obtain the generalization of the finite alphabet result 
of Theorem 2.6.2 



Corollary 7.4.3: Given a stationary process distribution p which satisfies 
the finite-gap information property, 

inf inf H p ii m (X) = inf H ,, (fc) (X) = lim H ' p m(X) = 0. 

k m£M k k k— too 

Lemma 7.4.1 also yields the following approximation lemma. 

Corollary 7.4.4: Given a process { X n } with standard alphabet A let p 
and m be stationary measures such that P\ n << Mx n for all n and m is fcth 
order Markov. Let q k be an asymptotically accurate sequence of quantizers for 
A. Then 

^p||m(^0 lim iLp|| m (^/ c (Jf)), 

k—*o o 

that is, the divergence rate can be approximated arbitrarily closely by that of 
a quantized version of the process. Thus, in particular, 

H pl \ m (X) = H; i[m (X). 
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Proof: This follows from Corollary 5.2.3 by letting the generating cr-fields 
be T n = <j(q n (Xi); i = 0, —1, • • •) and the representation of conditional relative 
entropy as an ordinary divergence. □ 

Another interesting property of relative entropy rates for stationary pro- 
cesses is that we can “reverse time” when computing the rate in the sense of 
the following lemma. 

Lemma 7.4.2: Let {X„}, p, and m be as in Lemma 7.4.1. If either 
H p \\m(X) < oo or Hp\\ M (X 0 \X~) < oo, then 

Hp\\m(Xo \X— i , * * * , X—n) = Hp\ | m (X 0 |Xi , • • • , Xfi) 



and hence 



Hp\ | rn (Xq | X\ , X 2 , * ' ') — iJ p || TO (Xo||X_i, X_ 2 , ■ ■ ■) — H p || m (X) < OO. 



Proof: If H p \\ m {X) is finite, then so must be the terms H p ^ m (X n ) = D(P X « \ \ M X n) 
(since otherwise all such terms with larger n would also be infinite and hence 
H could not be finite). Thus from stationarity 

H p \\m{X o|X_i, • • • , X_ n ) = H p Nm (X„|X n ) 

= D(P X n+l\\M X n +1 ) - D(P X n\\M X n) 

D(P X n + l | \M X n+l) - D(P X n\\M X n) = H p \ | m (X 0 |X, , • • • , X n ) 

from which the results follow. If on the other hand the conditional relative 
entropy is finite, the results then follow as in the proof of Lemma 7.4.1 using the 
fact that the joint relative entropies are arithmetic averages of the conditional 
relative entropies and that the conditional relative entropy is defined as the 
divergence between the P and S measures (Theorem 5.3.2). □ 



7.5 Mean Ergodic Theorems 

In this section we state and prove some preliminary ergodic theorems for relative 
entropy densities analogous to those first developed for entropy densities in 
Chapter 3 and for information densities in Section 6.3. In particular, we show 
that an almost everywhere ergodic theorem for finite alphabet processes follows 
easily from the sample entropy ergodic theorem and that an approximation 
argument then yields an L l ergodic theorem for stationary sources. The results 
involve little new and closely parallel those for mutual information densities 
and therefore the details are skimpy. The results are given for completeness and 
because the L 1 results yield the byproduct that relative entropies are uniformly 
integrable, a fact which does not follow as easily for relative entropies as it did 
for entropies. 
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Finite Alphabets 

Suppose that we now have two process distributions p and m for a random 
process {X n } with finite alphabet. Let Px« and Mj« denote the induced 
?rth order distributions and px n and mx n the corresponding probability mass 
functions (pmf’s). For example, px n {a n ) = Px n ({x n '■ x n = a n }) = p{{x : 
X n (x) = a n }). We assume that Px« « In this case the relative 

entropy density is given simply by 

K{x) = h x n(X n )(x) = ^ P xn (x ) 

mx"(x n ) 

where x n = X n (x). 

The following lemma generalizes Theorem 3.1.1 from entropy densities to 
relative entropy densities for finite alphabet processes. Relative entropies are of 
more general interest than ordinary entropies because they generalize to contin- 
uous alphabets in a useful way while ordinary entropies do not. 

Lemma 7.5.1: Suppose that {X n } is a finite alphabet process and that p 
and m are two process distributions with Mx n >> Px n for all n, where p is 
AMS with stationary mean p, in is a fcth order Markov source with stationary 
transitions, and {p x } is the ergodic decomposition of the stationary mean of p. 
Assume also that Mx n » Px n for all n. Then 

lim —h n = h\ p — a.e. and in L 1 ^), 

n— >oo 77, 

where h(x) is the invariant function defined by 

h(x) = - H Px (X ) - E Px lnm(X k \X k ) 

= lim -H Pxllm (X n ) = H Px | |ro (X), (7.27) 

n— >oo 71 

where 

m{X k \X k ){x) = mx = M Xk \x«(xk \x k ). 

m X k{x K ) 1 

Furthermore, 

E p h = H p Mro (X)= lim -H pllm (X n ), (7.28) 

that is, the relative entropy rate of an AMS process with respect to a Markov 
process with stationary transitions is given by the limit. Lastly, 

H p \\ m (X) = H m . m (X)- (7.29) 

that is, the relative entropy rate of the AMS process with respect to m is the 
same as that of its stationary mean with respect to m. 

Proof: We have that 

11 1 1 

- h(X n ) = - lnp(X n ) - - In m(X k ) + - V In m(XdX fc _,.) 

7i n n 7i ' 

i=k 
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1 1 1 

= — lnp(X n ) - - In m(X k ) - - V lnm(X fc |X fc )T i - fe , (7.30) 
n n n 

i=k 

where T is the shift transformation, p(X n ) is an abbreviation for P x ™{X n ), and 
m(Xk \X k ) = M Xlc \x k (Xk\X k ). From Theorem 3.1.1 the first term converges to 
—H Px (X)p- a.e. and in L 1 (p). 

Since M x * » P x k i if M x > *(F) = 0, then also P x k (F ) = 0. Thus P x k and 
hence also p assign zero probability to the event that M x k (X k ) = 0. Thus with 
probability one under p, ln?n(X fc ) is finite and hence the second term in (7.5.4) 
converges to 0 p- a.e. as n — > oo. 

Define a as the minimum nonzero value of the conditional probability m{xk\x k ). 
Then with probability 1 under M X n and hence also under P x » we have that 

r^ 1 , i , i 

since otherwise the sequence X n would have 0 probability under M X n and hence 
also under P X n and OlnO is considered to be 0. Thus the rightmost term of 
(7.30) is uniformly integrable with respect to p and hence from Theorem 1.8.3 
this term converges to E Px (lnm(Xk\X k )). This proves the leftmost equality of 
(7.27). 

Let px n \x denote the distribution of X n under the ergodic component p x . 
Since M X n » Px™ and P x * = f dp(x)p X n\ x , ifM X n(F) = 0, thenp x „| x (F) = 

0 p- a.e. Since the alphabet of X n if finite, we therefore also have with probability 
one under p that M x «- » p X n \x and hence 

1 1 t v n \ \ — / n \ l PX n |a:(n ) 

Ppx\\ m{X ) — 2_^PX"\x{a )l n ]\,f ( a n\ 
a n X \ ) 

is well defined for p-almost all x. This expectation can also be written as 

n — 1 

Hp*\\m( xn ) = —H Px (X n ) - Ep x [lnm(X k ) + Y J ^m{X k \X k )T i ~ k } 

i=k 

= —Hp x (X n ) - E Px [lnm(X k )} - (n - k)E Px [lnm(X k \X k )}, 

where we have used the stationarity of the ergodic components. Dividing by 
n and taking the limit as n — > oo, the middle term goes to zero as previously 
and the remaining limits prove the middle equality and hence the rightmost 
inequality in (7.27). 

Equation (7.28) follows from (7.27) and L 1 (p) convergence, that is, since 
n~ 1 h n — > h, we must also have that E p {n~ 1 h n {X n )) = n~ 1 H p \\ m (X n ) converges 
to E p h. Since the former limit is H p \\ m (X), (7.28) follows. Since p x is invariant 
(Theorem 1.8.2) and since expectations of invariant functions are the same under 
an AMS measure and its stationary mean (Lemma 6.3.1 of [50]), application of 
the previous results of the lemma to both p and p proves that 

J* dp(x')H Px \\ rn (X') dp(x)L/p a ,|| m (^A) H p || m (A”), 
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which proves (7.30) and completes the proof of the lemma. □ 

Corollary 7.5.1: Given p and m as in the Lemma, then the relative entropy 
rate of p with respect to in has an ergodic decomposition, that is, 

J dp(x')Hp x \\ rn (X') . 



Proof: This follows immediately from (7.27) and (7.28). □ 



Standard Alphabets 

We now drop the finite alphabet assumption and suppose that {X n } is a stan- 
dard alphabet process with process distributions p and m , where p is stationary, 
to is kth order Markov with stationary transitions, and Mx n >> Px n are the 
induced vector distributions for n = 1, 2, • • • . Define the densities f n and en- 
tropy densities h n as previously. 

As an easy consequence of the development to this point, the ergodic de- 
composition for divergence rate of finite alphabet processes combined with the 
definition of H* as a supremum over rates of quantized processes yields an ex- 
tension of Corollary 6.2.1 to divergences. This yields other useful properties as 
summarized in the following corollary. 

Corollary 7.5.1: Given a standard alphabet process {X n } suppose that p 
and to are two process distributions such that p is AMS and m is fcth order 
Markov with stationary transitions and Mx n » Px n are the induced vector 
distributions. Let p denote the stationary mean of p and let {p x } denote the 
ergodic decomposition of the stationary mean p. Then 

H* P \\m(X) = j dp(x)H* Pxllm (X). (7.31) 



In addition, 



H;\\m{X) = h; „ m (A) = H pllm (X) = H pllm (X)- (7.32) 

that is, the two definitions of relative entropy rate yield the same values for 
AMS p and stationary transition Markov to and both rates are the same as the 
corresponding rates for the stationary mean. Thus relative entropy rate has an 
ergodic decomposition in the sense that 

H p \\ m (X) = j dp(x)H Pxllm (X). (7.33) 

Comment: Note that the extra technical conditions of Theorem 6.4.2 for 
equality of the analogous mutual information rates I and I* are not needed 
here. Note also that only the ergodic decomposition of the stationary mean p 
of the AMS measure p is considered and not that of the Markov source to. 
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Proof: The first statement follows as previously described from the finite 
alphabet result and the definition of H* . The left-most and right-most equalities 
of (7.32) both follow from the previous lemma. The middle equality of (7.32) 
follows from Corollary 7.4.2. Eq. (7.33) then follows from (7.31) and (7.32). □ 

Theorem 7.5.1: Given a standard alphabet process { X n } suppose that p 
and to are two process distributions such that p is AMS and to is fcth order 
Markov with stationary transitions and Mx« >> Px » are the induced vector 
distributions. Let {p x } denote the ergodic decomposition of the stationary mean 
p. If 

lim -H \\ m (X n ) = H \\ m (X) < oo, 

n— >oo XI 

then there is an invariant function h such that n~ 1 h n — > h in L 1 (p) as n — > oo. 
In fact, 

h{x) = Hp x || m (A), 

the relative entropy rate of the ergodic component p x with respect to to. Thus, 
in particular, under the stated conditions the relative entropy densities h n are 
uniformly integrable with respect to p. 

Proof: The proof exactly parallels that of Theorem 6.3.1, the mean ergodic 
theorem for information densities, with the relative entropy densities replacing 
the mutual information densities. The density is approximated by that of a 
quantized version and the integral bounded above using the triangle inequality. 
One term goes to zero from the finite alphabet case. Since H = H* (Corollary 
7.5.1) the remaining terms go to zero because the relative entropy rate can be 
approximated arbitrarily closely by that of a quantized process. □ 

It should be emphasized that although Theorem 7.5.1 and Theorem 6.3.1 
are similar in appearance, neither result directly implies the other. It is true 
that mutual information can be considered as a special case of relative entropy, 
but given a pair process {X n , Y n } we cannot in general find a fcth order Markov 
distribution to for which the mutual information rate I{X\ Y) equals a relative 
entropy rate H p || m . We will later consider conditions under which convergence 
of relative entropy densities does imply convergence of information densities. 




Chapter 8 



Ergodic Theorems for 
Densities 

8.1 Introduction 

This chapter is devoted to developing ergodic theorems first for relative entropy 
densities and then information densities for the general case of AMS processes 
with standard alphabets. The general results were first developed by Barron [9] 
using the martingale convergence theorem and a new martingale inequality. The 
similar results of Algoet and Cover [7] can be proved without direct recourse to 
martingale theory. They infer the result for the stationary Markov approxima- 
tion and for the infinite order approximation from the ordinary ergodic theorem. 
They then demonstrate that the growth rate of the true density is asymptoti- 
cally sandwiched between that for the /cth order Markov approximation and the 
infinite order approximation and that no gap is left between these asymptotic 
upper and lower bounds in the limit as k — > oo. They use martingale theory 
to show that the values between which the limiting density is sandwiched are 
arbitrarily close to each other, but we shall see that this is not necessary and 
this property follows from the results of Chapter 6. 



8.2 Stationary Ergodic Sources 

Theorem 8.2.1: Given a standard alphabet process {X n }, suppose that p and 
to are two process distributions such that p is stationary ergodic and to is a K- 
step Markov source with stationary transition probabilities. Let Mx n >> Px n 
be the vector distributions induced by p and m. As before let 
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Then with probability one under p 

lim h n Hp\\ m i^N^j. 

n—> oo n 

Proof: Let p ^ denote the fc-step Markov approximation of p as defined in 
Theorem 7.3.1, that is, p ^ has the same kth order conditional probabilities 
and fc-dimensional initial distribution. From Corollary 7.3.1, if k > A', then 
(7.8)-(7.10) hold. Consider the expectation 



E p 



( fP n (x n ) \ 

yfx n (x n ) J 





dPx ». 



Define the set A n = {x n : fx n > 0}; then P x ^(A n ) = 1. Use the fact that 
fx n = dPx^/dMxn to write 



Ep 



( fP n (X n ) \ 




(/^) fxn dMxn 



From Corollary 7.3.1, 



and therefore 



= f f^ldMx 

JA n 



f(k) 

J X n ~ 



dP \ $ 

<2M Y » 



Ep 



( fP n (X n ) \ 

\fxAx n )J 




dP^l 

dAIx n 



dM x n 



p?2(a„) < i. 



Thus we can apply Lemma 5.4.2 to the sequence f x l(X n ), / fx n (X r 

rlo fViuf <n r»unKoV\i1ifir 1 



elude that with p-probability 






lim il JXCPI 
n *°° n fx"(X n ) 



< 0 



and hence 

lim — In ( X n ) < lim inf — fx n ( X n ) . 

n — >oo n n — >-oo 77, 



( 8 . 1 ) 



lb 7CX_) ll lb — 7C_XJ 1 1 

The left-hand limit is well defined by the usual ergodic theorem: 

1 1 n ~ 1 1 

im - hr fp n (X n ) = lim - £ In /*, |x » (A, \X k _ k ) + lim - In f x u (X k ) . 

—>oo Ti n—t 00 77, z ' 11 l — k n—t 00 TI 

1 — h 



Since 0 < fx k < oo with probability 1 under AI X k and hence also under P x k, 
then 0 < f x *(X k ) < oo under p and therefore n _1 ln f x *{X k ) — > 0 as n — > oo 
with probability one. Furthermore, from the ergodic theorem for stationary and 
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ergodic processes (e.g., Theorem 7.2.1 of [50]), since p is stationary ergodic we 
have with probability one under p using (7.20) and Corollary 7.4.1 that 



1 Jl ^ 

= „ 1 ™ 0 -E ln ^ o|X— (*o| 

l—k 

X-i, • • • , X_ k )T l = E p In f Xo\X-i,-”,X-k (*o|*-i, • • • , X_ fc ) 

-^p||m(^0 1 5 * * * ? X—fc'j Hp(k) | . 

Thus with (8.1) we now have that 

liminf-ln/xnpr) > H pllm (X 0 \X_ u - ■ ■ ,X_ k ) (8.2) 

n—*oo ft 

for any positive integer k. Since m is Tilth order Markov, Lemma 7.4.1 and the 
above imply that 

liminf-ln fx^X n ) > H pllrn (X 0 \X~) = H p \\ m (X), (8.3) 

n — ^oo n 

which completes half of the sandwich proof of the theorem. 

If H p \\ m {X) = oo, the proof is completed with (8.3). Hence we can suppose 
that H p \\ m (X) < oo. From Lemma 7.4.1 using the distribution Sx 0 ,X-i,X- 2 ,— 
constructed there, we have that 

D(P Xo ,X- 1 ,-\\S Xo ,x- 1 ,-) = Hp\\ m (X 0 \X~) = J dP Xo ,x- l n fx 0 \x- 



where 

_ dP x 0 ,x_i,- 

Jx 0 \x- jo 

a ^X 0 ,X-x,.. 

It should be pointed out that we have not (and will not) prove that fx 0 \X--i,—,X-n 
—> fx 0 \x-\ the convergence of conditional probability densities which follows 
from the martingale convergence theorem and the result about which most gen- 
eralized Shannon-McMillan-Breiman theorems are built. (See, e.g., Barron [9].) 

We have proved, however, that the expectations converge (Lemma 7.4.1), which 
is what is needed to make the sandwich argument work. 

For the second half of the sandwich proof we construct a measure Q which 
will be dominated by p on semi-infinite sequences using the above conditional 
densities given the infinite past. Define the semi-infinite sequence X~ = {•••, X n _i} 
for all nonnegative integers n. Let = cr(XJI) and B = <j(- ■ ■ , Xk-i) 
be the cr-fields generated by the finite dimensional random vector Xj} and the 
semi-infinite sequence X^ , respectively. Let Q be the process distribution 
having the same restriction to cr(Xj-) as does p and the same restriction to 
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a(X 0l Xi, • • •) as does p , but which makes X and X£ conditionally indepen- 
dent given X k for any n; that is, 

Qx~ = p x~ » 

k k 

Q Xk,Xk+ 1 , " PXk,Xk+i, - 5 

and X~ — > X k — > is a Markov chain for all positive integers n so that 

Q(X£ € F\X~) = Q(X£ e F|X fe ). 

The measure Q is a (nonstationary) fc-step Markov approximation to P in 
the sense of Section 5.3 and 



Q - Px-x(x k ,x k+1 ,-)\xi* 

(in contrast to P = P x -x k x°°)- Observe that X~ — > X fc — > is a Markov 

chain under both Q and m. 

By assumption, 

H p \\ m (X 0 |X-) < oo 
and hence from Corollary 7.4.1 



Hp\\m (XJtm = nH pllm (XJ!\X^) < oo 
and hence from Theorem 5.3.2 the density f x „ \ x - is well-defined as 



dS 



fx:\x. 



x~ 



k \x k p 



x~ 



where 

and 



S x~ +k ~ Mx Z\ xkP x~' 
j dPx ~ +k 111 ^ X Z \ X h = D ^ Px n +k 1 1 Sx ^ +k ^ 

= n Hp\\m(XJI\X^) < 00 . 



Thus, in particular, 



S x - » P x - . 

n+fc A n+fc 



Consider now the sequence of ratios of conditional densities 

f X n \X k (X n + k ) 



Cn = 



We have that 



fx”\x~ (^n+fc) 

dpCn = [ (n 



(8.4) 



(8.5) 



( 8 . 6 ) 
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where 



Gn — { x '■ fx%\x~ ( x n+fc) > 



since G n has probability 1 under p (or else (8.6) would be violated). Thus 



; - / /d ( f*Z\x k ( X + ) , 

dp( n = / dP x - — k — 1 

J x n+k i / 



{fx U x k >0} 



~ / dS x~ + Jx-ix- 



fx? \x*(X n+k ). 



>0 J' 



= / dS xJ x :< < / ds x -j x;ix .(x^). 

Using the definition of the measure S and iterated expectation we have that 

I dpCu< J dM x „ lx -dP x -f x n lxk (X n+k ). 

= J dM X n lXk dP x -f X n lXk (X n+k ). 

Since the integrand is now measurable with respect to a(X n+k ), this reduces 
to 

J dp( n < J dAI X ri\ X kdP X kf X n\ X k. 



Applying Lemma 5.3.2 we have 



dp(n < I dM X n\ x kdP x k-j-^r 



— J dP X kdP X n\ X k — 1. 



dPx k \x k 



J dp( n < 1 

and we can apply Lemma 5.4.1 to conclude that p- a.e. 

lim sup ( n = lim sup — In ^ < 0. 



n—>oo n—> oo 



n fx k \x~k 



Using the chain rule for densities, 

fxz\x k f x « 



fxwx- fx k nr=fe f>, 
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Thus from (8.7) 



lim sup ( - In / x » - - In f X k - - V' In / 
n— >oo yn n n i~k ' 



x l{ x- <0- 



Invoking the ergodic theorem for the rightmost terms and the fact that 
the middle term converges to 0 almost everywhere since In f X k is finite almost 
everywhere implies that 

lim sup -In /x« < E p (lnf x , x -) = E p (]nf Xo \ x ~) 

n—*oo fl k 

= H P \\ m (X). (8.8) 

Combining this with (8.3) completes the sandwich and proves the theorem. 

□ 



8.3 Stationary Nonergodic Sources 

Next suppose that the source p is stationary with ergodic decomposition {py A G 
A} and ergodic component function ij) as in Theorem 1.8.3. We first require some 
technical details to ensure that the various Radon-Nikodym derivatives are well 
defined and that the needed chain rules for densities hold. 

Lemma 8.3.1: Given a stationary source {X n }, let {py A G A} denote 
the ergodic decomposition and ijj the ergodic component function of Theorem 
1.8.3. Let Pf denote the induced distribution of ij). Let P X n and P x „ denote 
the induced marginal distributions of p and p\. Assume that { X n } has the 
finite-gap information property of (6.13); that is, there exists a K such that 

I p (X k -X~\X k ) <oo, (8.9) 

where X~ = (X_i, X_ 2 , ■ ■ •)• We also assume that for some n 

I(X n \ ip) < oo. (8.10) 

This will be the case, for example, if (8.9) holds for K = 0. Let m be a K- 
step Markov process such that M X n >> P x » for all n. (Observe that such 
a process exists since from (8.9) the ATli order Markov approximation p( A d 
suffices.) Define Mx n ,4> = M X n x P^. Then 

M X n^ >> P X n X P^J » P X n t Tp, ( 8 . 11 ) 

and with probability 1 under p 

M X n » P X n >> P X n. 



Lastly, 

dP% n _ _ dP X n 

dM x » Jxn ^ d ( Al x u x Pip ) 



(8.12) 
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and therefore 

dPt = dPtjdM xn = fxnw 
dP x » dP X n/dMx » /x- ‘ 1 j 

Proof: From Theorem 6.4.4 the given assumptions ensure that 

lim -E p i(X n -ip) = lim 0 (8.14) 

n—>oo Ti n — »oo 77, 

and hence P X n x P^ » P X n^ (since otherwise I(X n \ip) would be infinite for 
some n and hence infinite for all larger n since it is increasing with n). This 
proves the right-most absolute continuity relation of (8.11). This in turn implies 
that M X n x P^, » P X n^. The lemma then follows from Theorem 5.3.1 with 
X = X n , Y = ip and the chain rule for Radon-Nikodym derivatives. □ 

We know that the source will produce with probability one an ergodic com- 
ponent p\ and hence Theorem 8.2.1 will hold for this ergodic component. In 
other words, we have for all A that 

lim — In f X n^(X n \\) = H px (X); p x - a.e. 

n— >oo n 

This implies that 

lim -lnf X n\ i ,(X n \ip) = H p (X); p-a.e. (8.15) 

n — »oo Tl 

Making this step precise generalizes Lemma 3.3.1. 

Lemma 8.3.2: Suppose that {X n } is a stationary not necessarily ergodic 
source with ergodic component function ip. Then (8.15) holds. 

Proof: The proof parallels that for Lemma 3.3.1. Observe that if we have 
two random variables U,V (U = X 0 , X\, ■ ■ ■ and Y = ip above) and a sequence 
of functions g n (U,V) (n~ l f X n^(X n \tp)) and a function g(V) (H P ^(X)) with 
the property 

lim g n (U,v) = g{v),Pu\ v = v - a.e., 

n — »oo 

then also 

lim g n (U, V) = g(V); P uv - a.e. 

n—> oo 

since defining the (measurable) set G = {u,v : limy^^ g n (u,v) = g(v)} and its 
section G v = {it : (u,v) € G}, then from (1.26) 

Puv(G) = j Pu\ v {G v \v)dPv{v) = 1 

if Pu\ v (G v \v) = 1 with probability 1. □ 

It is not, however, the relative entropy density using the distribution of the 
ergodic component that we wish to show converges. It is the original sample 
density f X n . The following lemma shows that the two sample entropies converge 
to the same thing. The lemma generalizes Lemma 3.3.1 and is proved by a 
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sandwich argument analogous to Theorem 8.2.1. The result can be viewed as 
an almost everywhere version of (8.14). 

Theorem 8.3.1: Given a stationary source {X„}, let { p \ ; A € A} denote 
the ergodic decomposition and ip the ergodic component function of Theorem 
1.8.3. Assume that the finite-gap information property (8.9) is satisfied and 
that (8.10) holds for some n. Then 

lim —i(X n ; ip) = lim — In = 0; p — a.e. 

n —> oo n n —> oo fl Jx n 

Proof: From Theorem 5.4.1 we have immediately that 

liminf i n (X n \ ip) > 0, (8.16) 

n — »oo 

which provides half of the sandwich proof. 

To develop the other half of the sandwich, for each k > K let p ^ denote the 
fc-step Markov approximation of p. Exactly as in the proof of Theorem 8.2.1, 
it follows that (8.1) holds. Now, however, the Markov approximation relative 
entropy density converges instead as 

1 1 00 

lim -In /$(X") = lim - £ f Xk \x^X k \X k )T k = E p J Xk{xk (X k \X k ). 

n— >oo xi n— >oo fl L ' 1 

l=k 

Combining this with (8.15 we have that 

limsup - In < H p ^ m (X) - E p J Xk ^{X k \X k ). 

oo Tl JX n { A- ) 

From Lemma 7.4.1, the right hand side is just I p<p (X k -, X~\X k ) which from 
Corollary 7.4.2 is just H p ^ p (k) (X). Since the bound holds for all /c, we have that 

1, fx- |*(V”|« a 

‘“ S “ P n /A ,( A--) £ (X) = <• 

Using the ergodic decompostion of relative entropy rate (Corollary 7.5.1) that 
and the fact that Markov approximations are asymptotically accurate (Corollary 
7.4.3) we have further that 

J dP 4 ,( = J dP^mi H p<p[[p(k) (X) 



<inf J dP^H p ^ ]pW (X) = inf F p| | p(fc) (X) = 0 

and hence = 0 with P ^ probability 1. Thus 



lim sup — In 

n—> oo fl 



fx n \iji(X n \ip) 

fx n (X n ) 



< 0 , 



(8.17) 
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which with (8.16) completes the sandwich proof. □ 

Simply restating the theorem yields and using (8.15) the ergodic theorem 
for relative entropy densities in the general stationary case. 

Corollary 8.3.1: Given the assumptions of Theorem 8.3.1, 

lim - In f X n(X n ) = H p \\ m (X),p - a.e. 

The corollary states that the sample relative entropy density of a process 
satisfying (8.9) converges to the conditional relative entropy rate with respect 
to the underlying ergodic component. This is a slight extension and elaboration 
of Barron’s result [9] which made the stronger assumption that Hp|| m (Ao|X _ ) = 
H p \\ m (X) < oo. From Corollary 7.4.3 this condition is sufficient but not nec- 
essary for the finite-gap information property of (8.9). In particular, the finite 
gap information property implies that 



H p \\ P w(X) = I p {X k -X~\X k ) < oo, 

but it need not be true that H p u m (X) < oo. In addition, Barron [9] and 
Algoet and Cover [7] do not characterize the limiting density as the entropy 
rate of the ergodic component, instead they effectively show that the limit 
is E Pi> (\n fx 0 \x-(Xo\X~)). This, however, is equivalent since it follows from 
the ergodic decomposition (see specifically Lemma 8.6.2 [50]) that fx 0 \x- = 
fx 0 \x-,tp with probability one since the ergodic component if can be deter- 
mined from the infinite past X~ . 



8.4 AMS Sources 



The following lemma is a generalization of Lemma 3.4.1. The result is due to 
Barron [9], who proved it using martingale inequalities and convergence results. 

Lemma 8.4.1: Let {X n } be an AMS source with the property that for 
every integer k there exists an integer l = l(k) such that 

I p (X k -(X k+l ,X k+l+1 ,---)\X l k ). <00. (8.18) 



Then 

lim -i(X k ; (X k + l,---, X n _i) \ X l k ) = 0; p - a.e. 

n —* oo 77 

Proof: By assumption 

I p (X k -(X k+h X k+l+1 ,.--) \X{) = 



Ep In 



fx k \x k ,x k+1 ,- (X k \X k ,X k+1 ,---) 
f x k \x l k {X k \X]f) 



< 00 . 



Px k x(X k +l,-~)\X l k » Px 0 ,x 1 „... 



This implies that 
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with 

dPx 0 ,x 1 ,.... _ fx k \x k ,x k +i,- {X k \X k ,X k + !,•••) 

dp x k x(x k +i,-)\x‘ k fx k \x‘ k (X k \X l k ). 

Restricting the measures to X n for n > k + l yields 

dPx n _ fx k \x k ,x k +i,--,x n (X k \X k , X k + 1 , • • •) 

dPx k x(x k +i,--,x n ) \x l k fx k \x l k (X k \X l k ) 

= i(X k -(X k + l,---,X n ) \X l k ). 

With this setup the lemma follows immediately from Theorem 5.4.1. □ 

The following lemma generalizes Lemma 3.4.2 and will yield the general the- 
orem. The lemma was first proved by Barron [9] using martingale inequalities. 

Theorem 8.4.1: Suppose that p and m are distributions of a standard 
alphabet process {X n } such that p is AMS and m is fc-step Markov. Let p be a 
stationary measure that asymptotically dominates p (e.g., the stationary mean). 
Suppose that Px n , Px n , and are the distributions induced by p, p, and m 
and that M x « dominates both P x « and P X " for all n and that fx n and f x « 
are the corresponding densities. If there is an invariant function h such that 

lim — In fx n {X n ) = h; p — a.e. 

n— >oo 77, 

then also 

lim — In fx n (X n ) = h\ p— a.e. 

n— >■ OO 77 

Proof: For any k and n > k we can write using the chain rule for densities 

-In fx" ~ — In f x n ~ k = - In f X k\ x n ~ k - 

n n k n ' k 



Since for k < l < n 



“ In fx k \x£~ k — — ^ n fx k \x l k + “ i{X k \ (X k+ i, ■ ■ ■ , X n _i)|X|.), 

Lemma 8.4.1 and the fact that densities are finite with probability one implies 
that 

lim — In / vfc | Y n-k =0; p — a.e. 

n— >oo 77 I k 

This implies that there is a subsequence k(n) — > oo such that 



!n fx»(X n ) - n ln/ x n- r) (X^ ( " ) ); - 0,p- a.e. 



To prove this, for each k chose N(k) large enough so that 



1 



P(\ir(^fx k \x^(X k \X^ k) - k )\ > 2~ k ) < 2 



■\-k 
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and then let k(n) = k for N(k) < n < N(k + 1). Then from the Borel-Cantelli 
lemma we have for any e that 

P(l ivjfc) ln/ ^ fc \x^-"( xk \ X k (k) ~ k )\ > 6 Lo -) = 0 



and hence 



lim - In f x «(X n ) = lim - In / j (X" ? (n) ); p - a.e. 

r>. — r) -n . — n 



fc(ra) 



In a similar manner we can also choose the sequence so that 



lim -In /*„(*")= lim - In / ' n -H») (X" ? (n) ); p - a.e. 

• n-KX) 71 A fc(n) # ‘' W 



1 

n—* oo 77 



From Markov’s inequality 



1 

n 



p(^ln/ x „- fc (xr fc ) > i In / x „-» (*£-*) + 6) 



= p(t 



f x: M x r k ) 



— > e" £ ) < e" 

— — / — 



dp— 



fx"~ k ( X k~ k ) 



f X n-*( X r k ) 



7v»-*(xr fc ) 

fc A, 

= e" ne y dmf x n-k (X^~ k ) = e~ ne . 
Hence again invoking the Borel-Cantelli lemma we have that 



p( X ln/ JC „-*(Xjf" fc ) > — in f x n-k(X^~ k ) + e i.o.) = 0 

77 77- fe 

and therefore 

lim sup — ln/ x n-fc(X^ _fc ) < h,p— a.e. (8.19) 

n—*oo 71 fc 

The above event is in the tail er-fielcl f) n a(X n , X n +i, • • •) since h is invariant 
and p dominates p on the tail a- field. Thus 



lim sup - In /„-*(») (Xw J (n) ) < h; p - a.e. 
n— OO n 



and hence 

lim sup — In fx « (X") <h;p— a.e. 

n—*oo 71 

which half proves the lemma. 

Since p asymptotically dominates p, given e > 0 there is a /c such that 
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Again applying Markov’s inequality and the Borel-Cantelli lemma as previously 
we have that 



f / v n-fe(n)s 

1 , ( X Hn) ) 



lim inf — In — — - 



n — »oo n f .. ( v n ~ k ( n )\ 

J V’ l_fc(n H A fc(n) 1 



> 0; p — a.e. 



fcM 



which implies that 



p(Uminf-/ „- fc (»)(X£ fc ) > ft) > 
ra^oo n -*&(») 



and hence also that 

p(liminf — fx n (X n ) > h) > e. 

n — >-oo fi 

Since e can be made arbitrarily small, this proves that p- a.e. liminf n~^h n > h, 
which completes the proof of the lemma. □ 

We can now extend the ergodic theorem for relative entropy densities to the 
general AMS case. 

Corollary 8.4.1: Given the assumptions of Theorem 8.4.1, 



lim — In fx n (A") 
n—> oo 77, 






where is the ergodic component of the stationary mean p of p. 

Proof: The proof follows immediately from Theorem 8.4.1 and Corollary 
8.3.1, the ergodic theorem for the relative entropy density for the stationary 
mean. □ 



8.5 Ergodic Theorems for Information Densi- 
ties. 

As an application of the general theorem we prove an ergodic theorem for mutual 
information densities for stationary and ergodic sources. The result can be 
extended to AMS sources in the same manner that the results of Section 8.3 
were extended to those of Section 8.4. As the stationary and ergodic result 
suffices for the coding theorems and the AMS conditions are messy, only the 
stationary case is considered here. The result is due to Barron [9]. 

Theorem 8.5.1: Let {X n ,Y n } be a stationary ergodic pair random pro- 
cess with standard alphabet. Let Px n Y n , Px n , and Py « denote the induced 
distributions and assume that for all n Px n x Pyn >> P X nyn and hence the 
information densities 



in(X n -,Y n ) 



dPx n Y n 
d(P X n x Pyn) 



are well defined. Assume in addition that both the { X n } and {Y n } processes 
have the finite-gap information property of (8.9) and hence by the comment 
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following Corollary 7.3.1 there is a K such that both processes satisfy the K- 
gap property 

I(X K \ X- \X K ) < oo, I(Y k : Y~ \Y k ) < oo. 

Then 

lim -i n {X n -Y n ) = I(X;Y); p- a.e.. 

n— >oo Ti 

Proof: Let Z n = (X n ,Y n ). Let = P^} and Myn = PyV denote 

the ivth order Markov approximations of {X n } and {Y n }, respectively. The 
finite-gap approximation implies as in Section 8.3 that the densities 

f xn = dP XX 
JX dM x - 



and 



fy n = 



dP 



~Y"n 



dMyn 



are well defined. From Theorem 8.2.1 
1 



lim -\nf xn (X n ) = H < K) {X 0 \X~) = I{X k -X~\X k ) < oo, 

n— >oo u Px\\P x 



lim — In fyn (Y n ) = I(Y k -Y~\Y k ) < oo. 

n— >oo fi 

Define the measures AIz n by Mx« x Myn. Then this is a A'-step Markov 
source and since 

M X n x Myn >> Px n x Py n 
» P.X n ,Y n = Pz», 

the density 

fzn = 

JZ dM z » 

is well defined and from Theorem 8.2.1 has a limit 

lim - In f Z n(Z n ) = H p \\ m (Z 0 \Z~). 

n— >oo Tl 



If the density i n (X n , Y n ) is infinite for any n, then it is infinite for all larger 
n and convergence is trivially to the infinite information rate. If it is finite, the 
chain rule for densities yields 

-i n (X n -Y n ) = - In /z-(Z") - - In f x »{X n ) - - In fyn(Y n ) 
n n n n 



H p \\ vW {Z 0 \Z ) — H p || p (fc)(X 0 |A ) - H p ^ p ( k )(Y 0 \Y ) 

Pp\\p^ ^0 P-p\\p( k ) (^0 Pp\\p( k ) )• 

The limit is not indeterminate ( of the form oo — oo) because the two subtracted 
terms are finite. Since convergence is to a constant, the constant must also be 
the limit of the expected values of n~ 1 i n (X n ,Y n ), that is, I(X;Y). □ 
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Chapter 9 



Channels and Codes 



9.1 Introduction 

We have considered a random process or source {X n } as a sequence of random 
entities, where the object produced at each time could be quite general, e.g., 
a random variable, vector, or waveform. Hence sequences of pairs of random 
objects such as {X n ,Y n } are included in the general framework. We now focus 
on the possible interrelations between the two components of such a pair process. 
In particular, we consider the situation where we begin with one source, say 
{X n }, called the input and use either a random or a deterministic mapping to 
form a new source {Y n }, called the output. We generally refer to the mapping 
as a channel if it is random and a code if it is deterministic. Hence a code is 
a special case of a channel and results for channels will immediately imply the 
corresponding results for codes. The initial point of interest will be conditions 
on the structure of the channel under which the resulting pair process {X n , Y n } 
will inherit stationarity and ergodic properties from the original source {X n }. 
We will also be interested in the behavior resulting when the output of one 
channel serves as the input to another, that is, when we form a new channel 
as a cascade of other channels. Such cascades yield models of a communication 
system which typically has a code mapping (called the encoder ) followed by a 
channel followed by another code mapping (called the decoder). 

A fundamental nuisance in the development is the notion of time. So far we 
have considered pair processes where at each unit of time, one random object is 
produced for each coordinate of the pair. In the channel or code example, this 
corresponds to one output for every input. Interesting communication systems 
do not always easily fit into this framework, and this can cause serious problems 
in notation and in the interpretation and development of results. For example, 
suppose that an input source consists of a sequence of real numbers and let 
T denote the time shift on the real sequence space. Suppose that the output 
source consists of a binary sequence and let S denote its shift. Suppose also 
that the channel is such that for each real number in, three binary symbols are 
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produced. This fits our usual framework if we consider each output variable to 
consist of a binary three-tuple since then there is one output vector for each 
input symbol. One must be careful, however, when considering the stationarity 
of such a system. Do we consider the output process to be physically stationary 
if it is stationary with respect to S or with respect to S 3 ? The former might 
make more sense if we are looking at the output alone, the latter if we are looking 
at the output in relation to the input. How do we define stationarity for the pair 
process? Given two sequence spaces, we might first construct a shift on the pair 
sequence space as simply the cartesian product of the shifts, e.g., given an input 
sequence x and an output sequence y define a shift T* by T*{x,y) = ( Tx,Sy ). 
While this might seem natural given simply the pair random process {X ' n ,Y n }i 
it is not natural in the physical context that one symbol of X yields three 
symbols of Y. In other words, the two shifts do not correspond to the same 
amount of time. Here the more physically meaningful shift on the pair space 
would be T'(x, y) = ( Tx , S 3 y) and the more physically meaningful questions on 
stationarity and ergodicity relate to T' and not to T*. The problem becomes 
even more complicated when channels or codes produce a varying number of 
output symbols for each input symbol, where the number of symbols depends 
on the input sequence. Such variable rate codes arise often in practice, especially 
for noiseless coding applications such as Huffman, Lempel-Ziv, and arithmetic 
codes. (See [140] for a survey of noiseless coding.) While we will not treat such 
variable rate systems in any detail, they point out the difficulty that can arise 
associating the mathematical shift operation with physical time when we are 
considering cartesian products of spaces, each having their own shift. 

There is no easy way to solve this problem notationally. We adopt the 
following view as a compromise which is usually adequate for fixed-rate systems. 
We will be most interested in pair processes that are stationary in the physical 
sense, that is, whose statistics are not changed when both are shifted by an 
equal amount of physical time. This is the same as stationarity with respect 
to the product shift if the two shifts correspond to equal amounts of physical 
time. Hence for simplicity we will usually focus on this case. More general cases 
will be introduced when appropriate to point out their form and how they can 
be put into the matching shift structure by considering groups of symbols and 
different shifts. This will necessitate occasional discussions about what is meant 
by stationarity or ergodicity for a particular system. 

The mathematical generalization of Shannon’s original notions of sources, 
codes, and channels are due to Khinchine [72] [73]. Khinchine’s results char- 
acterizing stationarity and ergodicity of channels were corrected and developed 
by Adler [2]. 



9.2 Channels 

Say we are given a source [A, A, p], that is, a sequence of A- valued random 
variables {X n ; n € T} defined on a common probability space (H, T , P) having 
a process distribution p defined on the measurable sequence space (B T ,Ba T )- 
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We shall let X = {X n : n £ T} denote the sequence-valued random variable, 
that is, the random variable taking values in A T according to the distribution 
p. Let B be another alphabet with a corresponding measurable sequence space 
(. A t ,B b t ). We assume as usual that A and B are standard and hence so 
are their sequence spaces and cartesian products. A channel [A,i/,B\ with 
input alphabet A and output alphabet B (we denote the channel simply by v 
when these alphabets are clear from context) is a family of probability measures 
{v x ',x € A T j on ( B t ,Bb T ) (the output sequence space) such that for every 
output event F £ Bb T v x (F) is a measurable function of x. This measurability 
requirement ensures that the set function p specified on the joint input/output 
space ( A T x B r ), Ba T x Bb T ) by its values on rectangles as 

P (GxF) = [ dp(x)v x (F)-, F £ B b t , G £ B A T , 

Jg 

is well defined. The set function p is nonnegative, normalized, and countably 
additive on the held generated by the rectangles G x F, G £ B A T , F £ B b T ■ 
Thus p extends to a probability measure on the joint input /output space, which 
is sometimes called the hookup of the source p and channel v. We will often de- 
note this joint measure by pu. The corresponding sequences of random variables 
are called the input/output process. 

Thus a channel is a probability measure on the output sequence space for 
each input sequence such that a joint input/output probability measure is well- 
defined. The above equation shows that a channel is simply a regular conditional 
probability, in particular, 

u x {F) = p((x, y) :y £ F\x); F £ B b T , x £ A T . 

We can relate a channel to the notation used previously for conditional 
distributions by using the sequence- valued random variables X = { X n ; n £ T} 
and Y = {Y n \ n £ T}\ 

v x{F) = Py\x(F\x). (9.1) 

Eq. (1.26) then provides the probability of an arbitrary input/output event: 

p(F) = J dp(x)u x (F x ), 

where F x = {y : (x,y) eF} is the section of F at x. 

If we start with a hookup p, then we can obtain the input distribution p as 

p{F) = p(F x B t )- F £ B a t . 

Similarly we can obtain the output distribution, say rj, via 
77 (F) = p(A T x F); F £ B b t . 

Suppose one now starts with a pair process distribution p and hence also 
with the induced source distribution p. Does there exist a channel v for which 
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p = fivl The answer is yes since the spaces are standard. One can always define 
the conditional probability v x (F) = P{F x A r \X = x ) for all input sequences x, 
but this need not possess a regular version, that is, be a probability measure for 
all x , in the case of arbitrary alphabets. If the alphabets are standard, however, 
we have seen that a regular conditional probability measure always exists. 



9.3 Stationarity Properties of Channels 

We now define a variety of stationarity properties for channels that are related 
to, but not the same as, those for sources. The motivation behind the var- 
ious definitions is that stationarity properties of channels coupled with those 
of sources should imply stationarity properties for the resulting source-channel 
hookups. 

The classical definition of a stationary channel is the following: Suppose that 
we have a channel [A,u,B] and suppose that Ta and Tg are the shifts on the 
input sequence space and output sequence space, respectively. The channel is 
stationary with respect to Ta and Tg or ( TA,TB)-stationary if 

v x {T^F) = vt aX (F),x€ A t ,F g B b T - (9.2) 

If the transformations are clear from context then we simply say that the chan- 
nel is stationary. Intuitively, a right shift of an output event yields the same 
probability as the left shift of an input event. The different shifts are required 
because in general only Tax and not 1 x exists since the shift may not be 
invertible and in general only Tf, 1 F and not TbF exists for the same reason. If 
the shifts are invertible, e.g., the processes are two-sided, then the definition is 
equivalent to 

vt ax (TbF) = u T -i X (T^F) = v x {F), all x€A T ,Fe B b T (9.3) 

that is, shifting the input sequence and output event in the same direction does 
not change the probability. 

The fundamental importance of the stationarity of a channel is contained in 
the following lemma. 

Lemma 9.3.1: If a source [A, y], stationary with respect to Ta, is connected 
to channel [A,v,B], stationary with respect to Ta and T B , then the resulting 
hookup yv is also stationary (with respect to the cartesian product shift T = 
Taxb = T a xT b defined by T(x,y) = (: T A x,T B y )). 

Proof: We have that 

yi ;(T~ 1 F) = j dy^v^T^F)*). 



Now 



(T~ 1 F) X = {y : T(x,y) G F} = {y : ( T A x,T B y ) G F} 
= {y ■ T B y g f T ax } = t^F Tax 



(9.4) 
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and hence 

pv(T~ 1 F) = J dp(x)p x (T^ 1 F TAX ). 

Since the channel is stationary, however, this becomes 

pv(T~ 1 F) = J dn(x)i/T AX (F TA x) = j dpT^ 1 {x)u x (F x ), 

where we have used the change of variables formula. Since p is stationary, 
however, the right hand side is 

J dp{x)v x {F), 

which proves the lemma. □ 

Suppose next that we are told that a hookup pv is stationary. Does it then 
follow that the source p and channel v are necessarily stationary? The source 
must be since 

= Fv{{Ta x T b )~\F x B t )) = pu(F x B T ) = p(F). 

The channel need not be stationary, however, since, for example, the stationarity 
could be violated on a set of p measure 0 without affecting the proof of the 
above lemma. This suggests a somewhat weaker notion of stationarity which is 
more directly related to the stationarity of the hookup. We say that a channel 
[A, v , B\ is stationary with respect to a source [A, p] if pv is stationary. We also 
state that a channel is stationary p-a.e. if it satisfies (9.2) for all x in a set of 
^-probability one. If a channel is stationary p-a.e. and p is stationary, then 
the channel is also stationary with respect to p. Clearly a stationary channel 
is stationary with respect to all stationary sources. The reason for this more 
general view is that we wish to extend the definition of stationary channels to 
asymptotically mean stationary channels. The general definition extends; the 
classical definition of stationary channels does not. 

Observe that the various definitions of stationarity of channels immediately 
extend to block shifts since they hold for any shifts defined on the input and 
output sequence spaces, e.g., a channel stationary with respect to T A and Tjf 
could be a reasonable model for a channel or code that puts out I\ symbols 
from an alphabet B every time it takes in N symbols from an alphabet A. We 
shorten the name (T^ , Tjf )-stationary to (N, A")-stationary channel in this case. 
A stationary channel (without modifiers) is simply a (l,l)-stationary channel in 
this sense. 

The most general notion of stationarity that we are interested in is that of 
asymptotic mean stationarity We define a channel [A, v. B\ to be asymptotically 
mean stationary or AMS for a source [A. p] with respect to T a and T b if the 
hookup pv is AMS with respect to the product shift Ta x T b . As in the sta- 
tionary case, an immediate necessary condition is that the input source be AMS 
with respect to Ta- A channel will be said to be (Ta,T b )- AMS if the hookup 
is (Ta,T b )- AMS for all T4-AMS sources. 
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The following lemma shows that an AMS channel is indeed a generalization 
of the idea of a stationary channel and that the stationary mean of a hookup of 
an AMS source to a stationary channel is simply the hookup of the stationary 
mean of the source to the channel. 

Lemma 9.3.2: Suppose that v is (Ta, Tb) - stationary and that p is AMS 
with respect to Ta- Let y denote the stationary mean of y and observe that yv 
is stationary. Then the hookup yv is AMS with stationary mean 

Jlv = yv. 

Thus, in particular, v is an AMS channel. 

Proof: We have that 

(T~ i F) x = {y : (x,y) G T~ i F} = {y : T(x,y) G F} 

= {V ■ (T\x, T B y ) G F} = {y : T' B y G F^ x } = T^F^ (9.5) 
and therefore since v is stationary 

yv{T~ i F) = J dy(x)i' x (T B i F T ^ x ) 



= j ^P{ x ) v T i A x{F'T i A x) = J dnT A \x)v x {F). 



Therefore 






i—0 



1 ™ ) f r 

= - V' / dyT A l (x)v x (F) —> / dfi(x)u x (F) = yv{F) 

n z J / n^oo / 

i= o J J 

from Lemma 6.5.1 of [50]. This proves that y,v is AMS and that the stationary 
mean is jlv. □ 



A final property crucial to quantifying the behavior of random processes is 
that of ergodicity. Hence we define a (stationary, AMS) channel v to be ergodic 
with respect to ( Ta , T B ) if it has the property that whenever a (stationary, AMS) 
ergodic source (with respect to Ta) is connected to the channel, the overall 
input/output process is (stationary, AMS) ergodic. The following modification 
of Lemma 6.7.4 of [50] is the principal tool for proving a channel to be ergodic. 

Lemma 9.3.3: An AMS (stationary) channel [A,u,B] is ergodic if for all 
AMS (stationary) sources y and all sets of the form F = Fax F b , G = Ga xGb 
for rectangles Fa, Ga G B a and F b ,G b G we have that for p = yv 



n— 1 



lim ~Y J P{T A l B F^G)=p{F)p{G), 



n—*oo Jl 



i = 0 



(9.6) 



where p is the stationary mean of p {p if p is already stationary) . 
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Proof: The proof parallels that of Lemma 6.7.4 of [50]. The result does 
not follow immediately from that lemma since the collection of given sets does 
not itself form a held. Arbitrary events F,G € ^axb can be approximated 
arbitrarily closely by events in the held generated by the above rectangles and 
hence given e > 0 we can find finite disjoint rectangles of the given form F t , 
G i: i = 1, • • • , L such that if F 0 = U^=i Fi an d Go = U-f=i then p(FAF 0 ), 
p(GAG 0 ), p(FAFq), and p(GAG 0 ) are all less than e. Then 

n— 1 

|-^p(T- fc Ff)G)-p(F)p(G)| 

k—0 



< 



i-Ep(^n G ) 



k—0 



1 

n 



J2p(T~ k F 0 f]G 0 )\ 



k = 0 



+ 1^ ^p(T- fc F 0 f|G 0 ) -p(F 0 )p(Go)\ + \p(F 0 )p(G 0 ) - p(F)p(G)\. 

1 k—0 

Exactly as in Lemma 6.7.4 of [50], the rightmost term is bound above by 2e 
and the hrst term on the left goes to zero as n — » oo. The middle term is the 
absolute magnitude of 

1 n — 1 

- e p( T ~ k u ^ n u °j) - p( u wu ■ 

k—0 i j i j 



= e ( l e p( T ~ kF * n - p( F MGj) 

i,j \ k—0 

Each term in the hnite sum converges to 0 by assumption. Thus p is ergodic 
from Lemma 6.7.4 of [50]. □ 

Because of the specihc class of sets chosen, the above lemma considered 
separate sets for shifting and remaining hxed, unlike using the same set for 
both purposes as in Lemma 6.7.4 of [50]. This was required so that the cross 
products in the final sum considered would converge accordingly. 




9.4 Examples of Channels 

In this section a variety of examples of channels are introduced, ranging from the 
trivially simple to the very complicated. The first two channels are the simplest, 
the first being perfect and the second being useless (at least for communication 
purposes) . 
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Example 9.4.1: Noiseless Channel 

A channel [A, v, B] is said to be noiseless if A = B and 



v x (F) 



1 x G F 
0 x £ F 



that is, with probability one the channel puts out what goes in. Such a channel 
is clearly stationary and ergodic. 



Example 9.4.2: Completely Random Channel 

Suppose that rj is a probability measure on the output space (B t ,B b t ) and 
define a channel 

u x (F)=r,(F),FGB B T ,xeA T . 

Then it is easy to see that the input/output measure satisfies 

P(G x F) = v(F)p(G)-, F G B b t , G G B A T , 

and hence the input/output measure is a product measure and the input and 
output sequences are therefore independent of each other. This channel is called 
a completely random channel or product channel because the output is indepen- 
dent of the input. This channel is quite useless because the output tells us 
nothing of the input. The completely random channel is stationary (AMS) if 
the measure 77 is stationary (AMS). Perhaps surprisingly, such a channel need 
not be ergodic even if 77 is ergodic since the product of two stationary and er- 
godic sources need not be ergodic. (See, e.g., [21].) We shall later see that if 77 
is also assumed to be weakly mixing, then the resulting channel is ergodic. 

A generalization of the noiseless channel that is of much greater interest is 
the deterministic channel. Here the channel is not random, but the output is 
formed by a general mapping of the input rather than being the input itself. 



Example 9.4.3: Deterministic Channel and Sequence Coders 



A channel [A, v 1 B\ is said to be deterministic or nonrandom if each input string 
is mapped into a fixed output string, that is, if there is a mapping / : A T — > B T 
such that 



MG) 



1 f(x)eG 
0 f(x) t G ■ 



The mapping / must be measurable in order to satisfy the measurability as- 
sumption of the channel. Note that such a channel can also be written as 



M G ) = lf-i( G )(x)- 
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Define a sequence coder as a deterministic channel, that is, a measurable 
mapping from one sequence space into another. It is easy to see that for a 
deterministic code we have a hookup specified by 

p{FxG) = p{Ff]f~ 1 {G)) 

and an output process with distribution 

77 (G) = p(r\G)). 

A sequence coder is said to be (Ta,Tb)~ stationary (or just stationary) or 
(T 'a ,Tg)- stationary (or just (N, K) -stationary) if the corresponding channel 
is. Thus a sequence coder / is stationary if and only if /{Tax) = Tb/{x) and 
it is (AT, K)- stationary if and only if f{T^x) = Tgf(x). 

Lemma 9.4.1: A stationary deterministic channel is ergodic. 

Proof: From Lemma 9.3.3 it suffices to show that 

1 n— 1 

lim - ^p(TX' xB Ff)G)=p(F)P(G) 

n— >oo n L ' 1 1 

i= 0 

for all rectangles of the form F = Fa x F b , Fa € B b T , F B G Ba T and 
G = G a x G b ■ Then 

p(T£ B F Pi G) = pHt^Fa n Ga) x (t-;f b p) g b )) 

= n{{T?F A n Ga) H r\T- l F B p| G B )). 

Since / is stationary and since inverse images preserve set theoretic operations, 

r l ( TjF B pi g b ) = r?r\F B ) p r\G B ) 

and hence 

1 n— 1 
i = 0 

1 n— 1 

= E ^ T a\Fa p r\F B )) p Ga p r\G B )) 

n i- 0 

- M^Apr^MGApr^Gs)) = p{F A X Fb)p{Ga x G b ) 

since p is ergodic. This means that the rectangles meet the required condition. 
Some algebra then will show that finite unions of disjoint sets meeting the 
conditions also meet the conditions and that complements of sets meeting the 
conditions also meet them. This implies from the good sets principle (see, for 
example, p. 14 of [50]) that the field generated by the rectangles also meets the 
condition and hence the lemma is proved. □ 
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A stationary sequence coder has a simple and useful structure. Suppose one 
has a mapping / : A T — > B , that is, a mapping that maps an input sequence into 
an output letter. We can define a complete output sequence y corresponding to 
an input sequence x by 

yi = f(T\x)-,ieT, (9.7) 

that is, we produce an output, then shift or slide the input sequence by one time 
unit, and then we produce another output using the same function, and so on. A 
mapping of this form is called an infinite length sliding block code because it pro- 
duces outputs by successively sliding an infinite length input sequence and each 
time using a fixed mapping to produce the output. The sequence-to-letter map- 
ping implies a sequence coder, say /, defined by f(x ) = {f(T\x)]i € T}. Fur- 
thermore, J[Tax) = T B f(x), that is, a sliding block code induces a stationary 
sequence coder. Conversely, any stationary sequence coder / induces a sliding 
block code / for which (9.7) holds by the simple identification f(x) = ( f(x))o , 
the output at time 0 of the sequence coder. Thus the ideas of stationary se- 
quence coders mapping sequences into sequences and sliding block codes map- 
ping sequences into letters by sliding the input sequence are equivalent. We can 
similarly define an (TV, AT)-sliding block code which is a mapping / : A r — > B K 
which forms an output sequence y from an input sequence x via the construction 

Vik = f{T^x). 

By a similar argument, ( N , A")-sliding block coders are equivalent to (N, K)- 
stationary sequence coders. When dealing with sliding block codes we will 
usually assume for simplicity that I\ is 1. This involves no loss in generality 
since it can be made true by redefining the output alphabet. 

Example 9.4.4: B-processes 

The above construction using sliding block or stationary codes provides an easy 
description of an important class of random processes that has several nice 
properties. A process is said to be a B-process or Bernoulli process if it can be 
defined as a stationary coding of an independent identically distributed (i.i.d.) 
process. Let p denote the original distribution of the i.i.d. process and let 77 
denote the induced output distribution. Then for any output events F and G 

rj(Ff^T^G) = p{f-\F{^Tz n G)) = p(f~ 1 (F) Q r^”/ _ 1 (G)), 

since / is stationary. But p is stationary and mixing since it is i.i.d. (see Section 
6.7 of [50]) and hence this probability converges to 

p(f- 1 (F))p(t 1 (G)) = y(F) v (G) 

and hence p is also mixing. Thus a i?-process is mixing of all orders and hence 
is ergodic with respect to Tg for all positive integers n. 

While codes that depend on infinite input sequences may not at first glance 
seem to be a reasonable physical model of a coding system, it is possible for 
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such codes to depend on the infinite sequence only through a finite number of 
coordinates. In addition, some real codes may indeed depend on an unboundedly 
large number of past inputs because of feedback. 

Suppose that we consider two-sided processes and that we have a measurable 
mapping 

D 

(j) : x Ai — > B 

i——M 

and we define a sliding block code by 

f (^) — A4 5 '**; x 0 5 * * * , , 

then / is a stationary sequence coder. The mapping (f) is also called a sliding 
block code or a finite-length sliding block code or a finite-window sliding block 
code. M is called the memory of the code and D is called the delay of the code 
since M past source symbols and D future symbols are required to produce the 
current output symbol. The window length or constraint length of the code is 
M+D+l, the number of input symbols viewed to produce an output symbol. If 
D = 0 the code is said to be causal. If M = 0 the code is said to be memoryless. 

There is a problem with the above model if we wish to code a oue-sided 
source since if we wish to start coding at time 0, there are no input symbols with 
negative indices. Hence we either must require the code be memoryless ( M = 0) 
or we must redefine the code for the first AI instances (e.g., by “stuffing” the 
code register with arbitrary symbols) or we must only define the output for times 
i > M. For two-sided sources a finite-length sliding block code is stationary. 
In the one-sided case it is not even defined precisely unless it is memory less, in 
which case it is stationary. 

Another case of particular interest is when we have a measurable mapping 
7 : A N — > B K and we define a sequence coder f(x) = y by 

UnK = ( UnK , VnK+li * * * 5 V(n+1)K — l) = T^njv)) 

that is, the input is parsed into nonoverlapping blocks of length N and each is 
successively coded into a block of length K outputs without regard to past or 
previous input or output blocks. Clearly N input time units must correspond 
to K output time units in physical time if the code is to make sense. A code of 
this form is called a block code and it is a special case of an (IV, I\) sliding block 
code. Such a code is trivially (Tjf ,T%)- stationary. 

We now return to genuinely random channels. The next example is perhaps 
the most popular model for a noisy channel because of its simplicity. 

Example 9.4.5: Memoryless channels 

Suppose that q Xo {') is a probability measure on Bb for all Xo € A and that for 
fixed F ,q Xo (F) is a measurable function of xq- Let v be a channel specified by 
its values on output rectangles by 
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for any finite index set J C T . Then v is said to be a memoryless channel. 
Intuitively, 

Pr (Yi eFi-iG J\X) = J] Pr(Y; G 

i&J 

For later use we pause to develop a useful inequality for mutual information 
between the input and output of a memoryless channel. For contrast we also 
describe the corresponding result for a memoryless source and an arbitrary 
channel. 

Lemma 9.4.2: Let {X„} be a source with distribution /i and let v be a 
channel. Let {X n ,Y n } be the hookup with distribution p. If the channel is 
memory less, then for any n 

n — 1 

I(X n -,Y n ) <J2l(Xi-,Y t ) 

i = 0 

If instead the source is memoryless, then the inequality is reversed: 

n— 1 

I(X n -Y n )>J2l(Xi\Y t ). 

i = 0 

Thus if both source and channel are memoryless, 

n— 1 

I{X n -Y n ) = Y J I{X i -,Y i ). 

i= 0 

Proof: First suppose that the process is discrete. Then 
I(X n ; Y n ) = H(Y n ) - H{Y n \X n ). 

Since by construction 

n— 1 

JY»|Xn(tf n |* n ) = PY 0 \X 0 (yi\ x i) 

i = 0 

an easy computation shows that 

n—1 

H(Y n \X n ) = ^H(Y i \X i ). 

i = 0 

This combined with the inequality 

n—l 

H(Y n ) < H(Yi) 

i = o 

(Lemma 2.3.2 used several times) completes the proof of the memoryless channel 
result for finite alphabets. If instead the source is memoryless, we have 



I{X n ■ Y n ) = H(X n ) - H{X n \Y n ) 
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n— 1 

= Y J H{X i ) - H{X n \Y n ). 

i=0 

Extending Lemma 2.3.2 to conditional entropy yields 

n — 1 

H(X n \Y n ) < ^ 2 H (Xi\Y n ) 

i=0 

which can be further overbounded by using Lemma 2.5.2 (the fact that reducing 
conditioning increases conditional entropy) as 

n — 1 

ff(* n |y n )< 53^1^0 

i = 0 



which implies that 



n— 1 n— 1 

/(X"; y") > £ H{Xi) - H(Xi\Yi) = 53 I(X Z ; y), 

0 -i— 0 

which completes the proof for finite alphabets. 

To extend the result to standard alphabets, first consider the case where the 
Y n are quantized to a finite alphabet. If the Yj, are conditionally independent 
given X k , then the same is true for q(Yk), k = 0, 1, • • • ,n— 1. Lemma 5.5.6 then 
implies that as in the discrete case, I(X n \Y n ) = H(Y n ) — H(Y n \X n ) and the 
remainder of the proof follows as in the discrete case. Letting the quantizers 
become asymptotically accurate then completes the proof. □ 

In fact two forms of memorylessness are evident in a memoryless channel. 
The channel is input memoryless in that the probability of an output event 
involving {Y,;. i £ {k, k + 1, • • • , m}} does not involve any inputs before time k, 
that is, the past inputs. The channel is also input nonanticipatory since this 
event does not depend on inputs after time in, that is, the future inputs. The 
channel is also output memoryless in the sense that for any given input x, output 
events involving nonoverlapping times are independent, i.e., 

Vx(Yi e F l n y> G F 2 ) = v x {Yx e F 1 )u x (Y 2 £ F 2 ). 

We pin down these ideas in the following examples. 

Example 9.4.6: Channels with finite input memory and 

anticipation 

A channel v is said to have finite input memory of order M if for all one-sided 
events F and all n 

VxdYi, y„+i, • • •) G F) = Vx'{{Y n , Y n+1 , ■ ■ •) £ F) 
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whenever ay = x\ for i > n — M. In other words, for an event involving Yfs 
after some time n, knowing only the inputs for the same times and M time 
units earlier completely determines the output probability. Channels with finite 
input memory were introduced by Feinstein [40]. Similarly v is said to have 
finite anticipation of order L if for all one-sided events F and all n 



t'x{{-",Y n )eF) = u x ,{{---,Y n )eF) 



provided x[ = ay for i < n + L. That is, at most L future inputs must be known 
to determine the probability of an event involving current and past outputs. 

Example 9.4.7: Channels with finite output memory 

A channel v is said to have finite output memory of order K if for all one-sided 
events F and G and all inputs x, if k > K then 

za((’ • • i Y„) € F • • •) £ G) = , Y n ) £ F)is x ((Y n+ k , *■■■)€ G); 

that is, output events involving output samples separated by more than K time 
units are independent. Channels with finite output memory were introduced by 
Wolfowitz [150]. 

Channels with finite memory and anticipation are historically important as 
the first real generalizations of memoryless channels for which coding theorems 
could be proved. Furthermore, the assumption of finite anticipation is physi- 
cally reasonable as a model for real-world communication channels. The finite 
memory assumptions, however, exclude many important examples, e.g., finite- 
state or Markov channels and channels with feedback filtering action. Hence 
we will emphasize more general notions which can be viewed as approximations 
or asymptotic versions of the finite memory assumption. The generalization of 
finite input memory channels requires some additional tools and is postponed 
to the next chapter. The notion of finite output memory can be generalized by 
using the notion of mixing. 

Example 9.4.8: Output mixing channels 

A channel is said to be output mixing (or asymptotically output independent 
or asymptotically output memoryless) if for all output rectangles F and G and 
all input sequences x 

lim \v x (T~ n FC\G) - v x (T~ n F)v x (G)\ = 0. 

More generally it is said to be output weakly mixing if 

1 n— 1 

lim - V \ v x (T~ i F C\G) - v x (fT~ i F)v x [G)\ = 0. 

n — »oo Tl z ' ii 

2—0 
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Unlike mixing systems, the above definitions for channels place conditions only 
on output rectangles and not on all output events. Output mixing channels 
were introduced by Adler [2]. 

The principal property of output mixing channels is provided by the following 
lemma. 

Lemma 9.4.3: If a channel is stationary and output weakly mixing, then 
it is also ergodic. That is, if v is stationary and output weakly mixing and if [i 
is stationary and ergodic, then also [iv is stationary and ergodic. 

Proof: The process /xzz is stationary by Lemma 9.3.1. To prove that it is 
ergodic it suffices from Lemma 9.3.3 to prove that for all input/output rectangles 
of the form F = F B x Fa, Fb € Ba T , Fa € Bb T ', and G = Gb x Ga that 



n — 1 



lim - V ixu(T~ i F f)G) = fit AFWG). 

r — n ' ^ I I 



n—*oo xi *- 

2=0 



We have that 



n— 1 



-Xycrt’f |G) — m(F)m(G) 

71 2 — 0 

^ n— 1 

= — ^2 ^((T^Fb P| Gb) x (T^ 1 Fa P Ga )) — ^v(Fb x Fa)/iv(Gb x Ga) 
11 2=0 

\ r 

= - V / dn{x)v x {Tp l FB P|G b ) — i iv{F b x Fa)h(G b x Ga) 

n^ 0 J T ^F A f]G A 

= (-£(/ dAx)v x {TB l F B H G b ) 

\n ^\JtFf a C\o a 11 

- f ) ) + ( - V 

j T - A ' F A r \ G A )) v n ^ 



/ dA x ) v x{T B l F B )v x {G B ) - fxiy(F B x F a )^v(G b x G a ) 

\JT-’F A f]G A ) 

The first term is bound above by 
1 n_1 /' 

- V / d/z(x)MT7F B r|G B ) - i/ x (VF b )i/ !B (G b )| 

n i= 0 



1 n_1 

< / dAx)~Y, I M t b f b f) G s) ~ ^(T- i F B )z/ x (G B )| 

71 i=0 



which goes to zero from the dominated convergence theorem since the integrand 
converges to zero from the output weakly mixing assumption. The second term 
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can be expressed using the stationarity of the channel as 

^ n— 1 

dn{x)v x {G B )- ^2 1 f a ( t ax)^txx( f b) - pv{F)pv(G). 
n z ' A 

i= 0 

The ergodic theorem implies that as n — > oo the sample average goes to its 
expectation 

J dp(x)l FA {x)v x (F B ) = nv{F) 

and hence the above formula converges to 0, proving the lemma. □ 

The lemma provides an example of a completely random channel that is also 
ergodic in the following corollary. 

Corollary 9.4.1: Suppose that v is a stationary completely random channel 
described by an output measure rj. If ?y is weakly mixing, then v is ergodic. That 
is, if /i is stationary and ergodic and 77 is stationary and weakly mixing, then 
pv = p x ?y is stationary and ergodic. 

Proof: If 77 is weakly mixing, then the channel v defined by u x (F) = rj(F), 
all x G A T , F £ Bb T is output weakly mixing. Thus ergodicity follows from 
the lemma. □ 

The idea of a memoryless channel can be extended to a block memoryless 
or block independent channel, as described next. 

Example 9.4.9: Block Memoryless Channels 

Suppose now that we have an integers N and K (usually K = N) and a probabil- 
ity measure q x N(-) on B ^ for each x N G A N such that q x N(F) is a measurable 
function of x N for each F G Bg. Let v be specified by its values on output 
rectangles by 

LfJ 

Vx{y : Vi € Gi\i = ra,* •* ,m + n — 1) = q x *r N (Gi), 

i = o 

where Gi G Bb, all i, where |_ z\ is the largest integer contained in z, and where 

Gi= x Fj 

j=m-\-iK 

with Fj = B if j > m + n. Such channels are called block memoryless channels 
or block independent channels. They are a special case of the following class of 
channels. 

Example 9.4.10: Conditionally Block Independent Chan- 
nels 

A conditionally block independent or CBI channel resembles the block memory- 
less channel in that for a given input sequence the outputs are block independent. 
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It is more general, however, in that the conditional probabilities of the output 
block may depend on the entire input sequence (or at least on parts of the input 
sequence not in the same time block). Thus a channel is CBI if its values on 
output rectangles satisfy 

Lir! 

v x (y : y% € Fi\i = m, • • • ,ra + n - 1) = v x (y : y^ N G G»). 

i = o 

where as before 

m+(i+l)K— 1 

Gi= x Fj 
j=m-\-iK 

with Fj = B if j > to + ?i. Block memoryless channels are clearly a special 
case of CBI channels. These channels have only finite output memory, but 
unlike the block memoryless channels they need not have finite input memory 
or anticipation. 

The primary use of block memoryless channels is in the construction of a 
channel given finite-dimensional conditional probabilities, that is, one has prob- 
abilities for output A'-tuples given input iV-tuples and one wishes to model a 
channel consistent with these finite-dimensional distributions. The finite dimen- 
sional distributions themselves may be the result of an optimization problem or 
an estimate based on observed behavior. An immediate problem is that a chan- 
nel constructed in this manner may not be stationary, although it is clearly 
(N, A')-stationary. The next example shows how to modify a block memoryless 
channel so as to produce a stationary channel. The basic idea is to occasion- 
ally insert some random spacing between the blocks so as to “stationarize” the 
channel. 

Before turning to the example we first develop the technical details required 
for producing such random spacing. 

Random Punctuation Sequences 

We demonstrate that we can obtain a sequence with certain properties by sta- 
tionary coding of an arbitrary stationary and ergodic process. The lemma is a 
variant of a theorem of Shields and Neuhoff [133] as simplified by Neuhoff and 
Gilbert [108] for sliding block codings of finite alphabet processes. One of the 
uses to which the result will be put is the same as theirs: constructing sliding 
block codes from block codes. 

Lemma 9.4.4: Suppose that {X n } is a stationary and ergodic process. 
Then given N and S > 0 there exists a stationary (or sliding block) coding 
/ : A T — > {0, 1, 2} yielding a ternary process { Z n } with the following properties: 

(a) { Z n } is stationary and ergodic. 

(b) {Z n } has a ternary alphabet {0, 1, 2} and it can output only iV-cells of the 

form Oil • • • 1 (0 followed by N — 1 ones) or individual 2’s. In particular, 
each 0 is always followed by at exactly N — 1 l’s. 
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(c) For all integers k 

< Pr (Z? = Oil • • • 1) < ^ 

N ~ v fe ' ~ N 

and hence for any n 

Pr (Z n is in an N — cell) >1 — 5. 

A process { Z n } with these properties is called an ( N , 6)-random blocking 
process or punctuation sequence { Z n }. 

Proof: A sliding block coding is stationary and hence coding a stationary 
and ergodic process will yield a stationary and ergodic process (Lemma 9.4.1) 
which proves the first part. Pick an e > 0 such that eN < S. Given the 
stationary and ergodic process {X n } (that is also assumed to be aperiodic in 
the sense that it does not place all of its probability on a finite set of sequences) 
we can find an event G £ Ba T having probability less than e. Consider the 
event F = G — T~ Z G, that is, F is the collection of sequences x for which 
x € G, but T l x ^ G for * = 1, • • • , iV — 1. We next develop several properties of 
this set. 

First observe that obviously fj,(F ) < fi(G) and hence 

M (F) < e. 

The sequence of sets T~ l F are disjoint since if y £ T~ l F, then T l y £ F C G 
and T l+l y ^ G for l = 1, • • • , N — 1, which means that THj ^ G and hence 
Tig ^ F for N — 1 > j > i. Lastly we need to show that although F may have 
small probability, it is not 0. To see this suppose the contrary, that is, suppose 
that n(G - U^ 1 T~ l G) = 0. Then 



N—l N—l 

MGf|( U - M(Gf|( U T_iG ) C ) = MG) 

2=1 2 = 1 

and hence A*(Ui^i 1 T~ l G\G) = 1. In words, if G occurs, then it is certain to 
occur again within the next N shifts. This means that with probability 1 the 
relative frequency of G in a sequence x must be no less than 1/N since if it 
ever occurs (which it must with probability 1), it must thereafter occur at least 
once every N shifts. This is a contradiction, however, since this means from the 
ergodic theorem that /x(G) > 1/N when it was assumed that p(G) < e < 1/N. 
Thus it must hold that p,(F) > 0. 

We now use the rare event F to define a sliding block code. The general 
idea is simple, but a more complicated detail will be required to handle a special 
case. Given a sequence x, define n( x) to be the smallest i for which T l x £ F: 
that is, we look into the future to find the next occurrence of F. Since F has 
nonzero probability, n( x) will be finite with probability 1. Intuitively, n( x) 
should usually be large since F has small probability. Once F is found, we code 
backwards from that point using blocks of a 0 prefix followed by — 1 l’s. The 
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appropriate symbol is then the output of the sliding block code. More precisely, 
if n(x) = kN + l, then the sliding block code prints a 0 if l = 0 and prints a 
1 otherwise. This idea suffices until the event F actually occurs at the present 
time, that is, when n(x) = 0. At this point the sliding block code has just 
completed printing an TV-cell of 0111 • • • 1. It should not automatically start a 
new N- cell, because at the next shift it will be looking for a new F in the future 
to code back from and the new cells may not align with the old cells. Thus 
the coder looks into the future for the next F,;that is, it again seeks n(x), the 
smallest i for which T l x £ F. This time n(x) must be greater than or equal to 
N since x is now in F and T~ l F are disjoint for * = 1, • • • TV — 1. After finding 
n(x) = kN + l, the coder again codes back to the origin of time. If l = 0, then 
the two codes are aligned and the coder prints a 0 and continues as before. If 
l ^ 0, then the two codes are not aligned, that is, the current time is in the 
middle of a new code word. By construction Z < TV — 1. In this case the coder 
prints l 2’s (filler poop) and shifts the input sequence l times. At this point 
there is an n(x) = kN for such that T n ^x £ F and the coding can proceed as 
before. Note that k is at least one, that is, there is at least one complete cell 
before encountering the new F . 

By construction, 2’s can occur only following the event F and then no more 
than TV 2’s can be produced. Thus from the ergodic theorem the relative fre- 
quency of 2’s (and hence the probability that Z n is not in an TV-block) is no 
greater than 



lb — _L lb — _L 

lim - y l 2 (Z 0 (T i ®)) < lim - V l F (T i x)N 

n — >-oc 77, ' ^ n— »■ oo 77 ' ^ 

i = 0 i = 0 

= Nix{F) <N^=S, 

that is, 

Pr (Z n is in an TV — cell) > 1 — S. 

Since Z n is stationary by construction, 

Pr (Zf = Oil • • • 1) = Pt(Z^ = Oil • • • 1) for all k. 



(9.8) 



Thus 

N—l 

Pr (Z? = Oil • • • 1) = - y Pr {Z» = Oil • • • 1). 

k = 0 

The events { Z £ = Oil • • • 1}, k = 0, 1, • • • , iV — 1 are disjoint, however, since 
there can be at most one 0 in a single block of N symbols. Thus 

N-l 

NFt(Z n = 011 • • • 1 ) = Fr ( Z k = 011 • • • 1 ) 

k = 0 



N—l 

= Pr( (J {Zfc = 011 • • • 1}). 

k—0 



(9.9) 
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Thus since the rightmost probability is between 1 — <5 and 1, 
i>Pr(Z 0 »= 011-1) 
which completes the proof. □ 

The following corollary shows that a finite length sliding block code can be 
used in the lemma. 

Corollary 9.4.2: Given the assumptions of the lemma, a finite-window 
sliding block code exists with properties (a)-(c). 

Proof: The sets G and hence also F can be chosen in the proof of the 
lemma to be finite dimensional, that is, to be measurable with respect to 
<j(X_k , • ■ ■ , Xk) for some sufficiently large K. Choose these sets as before 
with 5/2 replacing 5. Define n(x) as in the proof of the lemma. Since n(x) is 
finite with probability one, there must be an L such that if 

Bl = {x : n(x) > L}, 



then 

K b l) < ^ 

Modify the construction of the lemma so that if n{x) > L, then the sliding block 
code prints a 2. Thus if there is no occurrence of the desired finite dimensional 
pattern in a huge bunch of future symbols, a 2 is produced. If n(x) < L , then / 
is chosen as in the proof of the lemma. The proof now proceeds as in the lemma 
until (9.8), which is replaced by 



n— 1 



n— 1 



n— 1 



lim — \2{Zq(T 1 x)) < lim — 'V'' 1b l (T 1 x) + lim — lp(T l x)N 

l — KY) 71 • ^ 71 . — HY) 71 f ^ 71 . — HY) 71. • ^ 



i = 0 



2=0 
< < 5 . 



i= 0 



The remainder of the proof is the same. □ 

Application of the lemma to an i.i.d. source and merging the symbols 1 and 
2 in the punctuation process immediately yield the following result since coding 
an i.i.d. process yields a B-process which is therefore mixing. 

Corollary 9.4.3: Given an integer N and a <5 > 0 there exists an ( N , <5)- 
punctuation sequence {Z n } with the following properties: 

(a) {Z n j is stationary and mixing (and hence ergodic). 

(b) {Z n } has a binary alphabet {0, 1} and it can output only 7V-cells of the 

form Oil • • • 1 (0 followed by N — 1 ones) or individual ones, that is, each 
zero is always followed by at least N — 1 ones. 

(c) For all integers k 

- < Pr (Zj* = Oil • • • 1) < ^ 

and hence for any n 



Pr (Z n is in an N — cell) > 1 — <5. 
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Example 9.4.11: Stationarized Block Memoryless Channel 

Intuitively, a stationarized block memoryless (SBM) channel is a block memo- 
ryless channel with random spacing inserted between the blocks according to a 
random punctuation process. That is, when the random blocking process pro- 
duces TV-cells (which is most of the time), the channel uses the TV-dimensional 
conditional distribution. When it is not using an N cell, the channel produces 
some arbitrary symbol in its output alphabet. We now make this idea precise. 
Let N, K, and q x N (•) be as in the previous example. We now assume that 
K = N, that is, one output symbol is produced for every input symbol and 
hence output blocks have the same number of symbols as input blocks. Given 
S > 0 let 7 denote the distribution of an (TV, <5)-random blocking sequence {Z n }. 
Let /xx 7 denote the product distribution on ( A T x {0, 1 } T , x Bj 0 1 j); that is, 

p x 7 is the distribution of the pair process { X n , Z n } consisting of the original 
source {X n } and the random blocking source { Z n } with the two sources being 
independent of one another. Define a regular conditional probability (and hence 
a channel) n XtZ (F), F £ {Bb} T , x € A T , z £ {0, 1} T by its values on rectangles 
as follows: Given z, let Jq(z) denote the collection of indices i for which Zi is 
not in an TV-cell and let J\(z) denote those indices i for which Z{ = 0, that 
is, those indices where TV-cells begin. Let q* denote a trivial probability mass 
function on B placing all of its probability on a reference letter b*. Given an 
output rectangle 

F = {y ■ Uj £ Fj\ j £ J} = x Fj, 

jeJ 

define 

i — r T-r i+N—1 

n x , z (F)= n 9 *oe) n «*?( *. 

i£j 

where we assume that E = B if i ^ J . Connecting the product source fix 7 
to the channel n yields a hookup process {X n , Z n ,Y n j with distribution, say, 
r, which in turn induces a distribution p on the pair process {X n ,Y n } having 
distribution p on {X n }. If the alphabets are standard, p also induces a regular 
conditional probability for Y given X and hence a channel v for which p = pv. 
A channel of this form is said to be an (TV, 5)- stationarized block memoryless or 
SBM channel. 

Lemma 9.4.5:An SBM channel is stationary and ergodic. Thus if a sta- 
tionary (and ergodic) source p is connected to a u, then the output is stationary 
(and ergodic). 

Proof: The product source px 7 is stationary and the channel 7r is stationary, 
hence so is the hookup (p x 7)71 or {X n , Z n , Y n }. Thus the pair process {X n , Y n } 
must also be stationary as claimed. The product source p x 7 is ergodic from 
Corollary 9.4.1 since it can be considered as the input/output process of a 
completely random channel described by a mixing (hence also weakly mixing) 
output measure. The channel n is output strongly mixing by construction and 
hence is ergodic from Lemma 9.4.1. Thus the hookup {p x 7)7 r must be ergodic. 
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This implies that the coordinate process {X n ,Y n } must also be ergodic. This 
completes the proof. □ 

The block memoryless and SBM channels are principally useful for proving 
theorems relating finite-dimensional behavior to sequence behavior and for sim- 
ulating channels with specified finite dimensional behavior. The SBM channels 
will also play a key role in deriving sliding block coding theorems from block 
coding theorems by replacing the block distributions by trivial distributions, 
i.e., by finite-dimensional deterministic mappings or block codes. 

The SMB channel was introduced by Pursley and Davisson [29] for finite 
alphabet channels and further developed by Gray and Saadat [61], who called it 
a randomly blocked conditionally independent (RBCI) channel. We opt for the 
alternative name because these channels resemble block memoryless channels 
more than CBI channels. 

We now consider some examples that provide useful models for real-world 
channels. 

Example 9.4.12: Primitive Channels 

Primitive channels were introduced by Neuhoff and Shields [113], [110] as a phys- 
ically motivated general channel model. The idea is that most physical channels 
combine the input process with a separate noise process that is independent of 
the signal and then filter the combination in a stationary fashion. The noise 
is assumed to be i.i.d. since the filtering can introduce dependence. The con- 
struction of such channels strongly resembles that of the SBM channels. Let 7 
be the distribution of an i.i.d. process { Z n } with alphabet W, let p x 7 de- 
note the product source formed by an independent joining of the original source 
distribution p and the noise process Z n , let n denote the deterministic channel 
induced by a stationary sequence coder / : A T x W T —> B T mapping an in- 
put sequence and a noise sequence into an output sequence. Let r = (p x 7)7r 
denote the resulting hookup distribution and {X n , Z n ,Y n j denote the resulting 
process. Let p denote the induced distribution for the pair process {X n . Y n }. 
If the alphabets are standard, then p and p together induce a channel u x (F), 
x € A T , F € Bb T ■ A channel of this form is called a primitive channel. 

Lemma 9.4.6: A primitive channel is stationary with respect to any sta- 
tionary source and it is ergodic. Thus if p is stationary and ergodic and v is 
primitive, then pv is stationary and ergodic. 

Proof: Since p is stationary and ergodic and 7 is i.i.d. and hence mixing, 
p x v is stationary and ergodic from Corollary 9.4.1. Since the deterministic 
channel is stationary, it is also ergodic from Lemma 9.4.1 and the resulting 
triple {X n , Z n ,Y n j is stationary and ergodic. This implies that the component 
process { X n , Y n } must also be stationary and ergodic, completing the proof. □ 

Example 9.4.13: Additive Noise Channels 

Suppose that {X n } is a source with distribution p and that { W n } is a “noise” 
process with distribution 7. Let {X n ,W n } denote the induced product source, 
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that is, the source with distribution /i x 7 so that the two processes are indepen- 
dent. Suppose that the two processes take values in a common alphabet A and 
that A has an addition operation +, e.g., it is a semi-group. Define the sliding 
block code / by /( x, w) = Xq + wq and let / denote the corresponding sequence 
coder. Then as in the primitive channels we have an induced distribution r on 
triples {X n ,W n ,Y n } and hence a distribution on pairs {X. n ,Y n } which with \i 
induces a channel v if the alphabets are standard. A channel of this form is 
called a additive noise channel or a signal-independent additive noise channel. 
If the noise process is a B-process, then this is easily seen to be a special case 
of a primitive channel and hence the channel is stationary with respect to any 
stationary source and ergodic. If the noise is only known to be stationary, the 
channel is still stationary with respect to any stationary source. Unless the 
noise is assumed to be at least weakly mixing, however, it is not known if the 
channel is ergodic in general. 

Example 9.4.14: Markov Channels 

We now consider a special case where A and B are finite sets with the same 
number of symbols. For a fixed positive integer K, let P denote the space 
of all K x K stochastic matrices P = {P(i,j)',i,j = 1,2, Using the 

Euclidean metric on this space we can construct the Borel field V of subsets of 
P generated by the open sets to form a measurable space (P, V). This, in turn, 
gives a one-sided or two-sided sequence space (P T ,V T ). 

A map 4> : A T — > P T is said to be stationary if 4>Ta = Tp<j>. Given a 
sequence P € P T , let M(P) denote the set of all probability measures on 
(B r , B t ) with respect to which Y m , Y m+ i,Y m+ 2 , ■ ■ • forms a Markov chain with 
transition matrices P m , P m +i, ■ • • for any integer m, that is, A £ A4(P) if and 
only if for any m 

'MEn = Vrm > Ei = Vn] 
n— 1 

— A [I771 — Um\ 1 1 PiiVi , Vi+l ) i Tl 77 ?., yrm 1 Vn £ H 

i—m 

In the one-sided case only m = 1 need be verified. Observe that in general the 
Markov chain is nonlromogeneous. 

A channel [A, v. B] is said to be Markov if there exists a stationary measur- 
able map </> : A T — > P r such that v x £ M(c/)(x)), x £ A T . 

Markov channels were introduced by Kieffer and Rahe [86] who proved that 
one-sided and two-sided Markov channels are AMS. Their proof is not included 
as it is lengthy and involves techniques not otherwise used in this book. The 
channels are introduced for completeness and to show that several important 
channels and codes in the literature can be considered as special cases. A variety 
of conditions for ergodicity for Markov channels are considered in [60] . Most are 
equivalent to one already considered more generally here: A Markov channel is 
ergodic if it is output mixing. 

The most important special cases of Markov channels are finite state channels 
and codes. Given a Markov channel with stationary mapping </>, the channel 
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is said to be a finite state channel (FSC) if we have a collection of stochastic 
matrices P a £ P; a £ A and that <j>{x) n = P Xn , that is, the matrix produced by 4> 
at time n depends only on the input at that time, x n . If the matrices P a \ a £ A 
contain only 0’s and l’s, the channel is called a finite state code. There are 
several equivalent models of finite state channels and we pause to consider an 
alternative form that is more common in information theory. (See Gallager [43], 
Ch. 4, for a discussion of equivalent models of FSC’s and numerous physical 
examples.) An FSC converts an input sequence x into an output sequence y 
and a state sequence s according to a conditional probability 

Pr(Yfc = y k , S k = s fc ; k = m, ■ ■ ■ , n|Xj = x t , S z = s.f, i < to) 

n 

— | P{yii Si\Xi^ Si — i), 

i=m 

that is, conditioned on Xi, Si- 1 , the pair 1), S', is independent of all prior inputs, 
outputs, and states. This specifies a FSC defined as a special case of a Markov 
channel where the output sequence above is here the joint state-output sequence 
{yi,Si}. Note that with this setup, saying the Markov channel is AMS implies 
that the triple process of source, states, and outputs is AMS (and hence obvi- 
ously so is the Gallager input-output process). We will adapt the Kieffer-Rahe 
viewpoint and call the outputs {Y n } of the Markov channel states even though 
they may correspond to state-output pairs for a specific physical model. 

In the two-sided case, the Markov channel is significantly more general than 
the FSC because the choice of matrices <j>{x)i can depend on the past in a very 
complicated (but stationary) way. One might think that a Markov channel is 
not a significant generalization of an FSC in the one-sided case, however, be- 
cause there stationarity of <j> does not permit a dependence on past channel 
inputs, only on future inputs, which might seem physically unrealistic. Many 
practical communications systems do effectively depend on the future, however, 
by incorporating delay in the coding. The prime example of such look-ahead 
coders are trellis and tree codes used in an incremental fashion. Such codes in- 
vestigate many possible output strings several steps into the future to determine 
the possible effect on the receiver and select the best path, often by a Viterbi 
algorithm. (See, e.g., Viterbi and Omura [145].) The encoder then outputs only 
the first symbol of the selected path. While clearly a finite state machine, this 
code does not fit the usual model of a finite state channel or code because of 
the dependence of the transition matrix on future inputs (unless, of course, one 
greatly expands the state space). It is, however, a Markov channel. 

Example 9.4.15: Cascade Channels 

We will often wish to connect more than one channel in cascade in order to 
form a communication system, e.g., the original source is connected to a de- 
terministic channel (encoder) which is connected to a communications channel 
which is in turn connected to another deterministic channel (decoder) . We now 
make precise this idea. Suppose that we are given two channels [A, to 1 ), C\ and 
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[C, is( 2 \B\. The cascade of z/ 1 ) and is^ is defined as the channel [A, is, B] given 




In other words, if the original source sequence is X, the output to the first 
channel and input to the second is U, and the output of the second channel is 
Y, then v£\f) = Pu\ x (F\x), v u (G) = P Y \u(G\u), and v x (G) = P Y \ x (G\x). 
Observe that by construction X — > U — > Y is a Markov chain. 

Lemma 9.4.7: A cascade of two stationary channels is stationary. 

Proof: Let T denote the shift on all of the spaces. Then 

u x (T~ 1 F)= [ vW{T~'F)dv£\u). 

Jc T 

= [ v^\F)dv^T~\u). 

Jc T 

But ui 1 \T~ 1 F) = is T x ( ' 1 \F), that is, the measures is^T -1 and is^l are iden- 
tical and hence the above integral is 

[ v u ] ( F ) dv^ x (u) = is Tx (F) , 

Jc T 



proving the lemma. □ 

Example 9.4.16: Communication System 

A communication system consists of a source [A, fj\ , a sequence encoder f : 
A T B t (a deterministic channel), a channel [B,is, B'], and a sequence de- 
coder g : B' — > A t . The overall distribution r is specified by its values on 
rectangles as 

r(Fi xF 2 xF 3 x F 4 ) = I dg.{x)v f ( x) {F 3 C > \g^ 1 {F A )). 

JFiClf-BFA 

Denoting the source by {X n }, the encoded source or channel input process by 
{[/„}, the channel output process by {Y n }, and the decoded process by {X n }, 
then r is the distribution of the process {X n , U n , Y n , X n }. If we let X,U,Y , and 
X denote the corresponding sequences, then observe that X — > U — > Y and 
U — > Y —s X are Markov chains. We abbreviate a communication system to 
[p,f, f, g\. 

It is straightforward from Lemma 9.4.7 to show that if the source, channel, 
and coders are stationary, then so is the overall process. 

The following is a basic property of a communication system: If the com- 
munication system is stationary, then the mutual information rate between the 
overall input and output cannot that exceed that over the channel. The result 
is often called the data processing theorem. 
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Lemma 9.4.8: Suppose that a communication system is stationary in the 
sense that the process {X n ,U n ,Y n , X n } is stationary. Then 

I(U-,Y)>I{X-Y)>I(X;X). (9.10) 

If {U n } has a finite alphabet or if it has has the K- gap information property 
(6.13) and I(U K ,Y) < oo, then 

I(X; X) < I(U;Y). 

Proof: Since {X n } is a stationary deterministic encoding of the {Y n } 

I(X;X) < I*(X- V). 

From Theorem 6.4.1 the right hand side is bounded above by I(X\ Y). For each 
n 

I(X n -Y n ) < I((X n , U); Y n ) 

= I(Y n ; U ) + I(X n ; Y n \U) = I(Y n ; U), 

where U = {U n ,n £ T} and we have used the fact that X — > U — > Y is 
a Markov chain and hence so is X N —>[/—> Y K and hence the conditional 
mutual information is 0 (Lemma 5.5.2). Thus 

I(X- Y) < lim I(Y n ; U) = I[Y\ U ). 

n— >oo 

Applying Theorem 6.4.1 then proves that 

i(X-X) < I{Y- U). 

If {U n } has finite alphabet or has the A'-gap information property and / (U K , Y ) < 
oo, then from Theorems 6.4.1 or 6.4.3, respectively, I(Y\U ) = I((Y\U ), com- 
pleting the proof. □ 

The lemma can be easily extended to block stationary processes. 

Corollary 9.4.4: Suppose that the process of the previous lemma is not sta- 
tionary, but is (N, A")-stationary in the sense that the vector process {X^ N , U^ K , Y^ K , X^ N } 
is stationary. Then 

I(X;X)<p(U;Y). 

Proof: Apply the previous lemma to the stationary vector sequence to find 
that 

I(X N ; X N ) < I(U k ;Y k ). 

But 

1(1*; X N ) = lim —I(X nN ; X nN ) 

n — >-oc ft 

which is the limit of the expectation of the information densities n~ 1 i xnN ^ nN 
which is N times a subsequence of the densities n~ 1 i xn xr , whose expectation 
converges to I(X; Y). Thus 

I(X N -,X N ) = NI(X; X). 

A similar manipulation for I(U K ’,Y K ) completes the proof. □ 
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9.5 The Rohlin-Kakutani Theorem 



The punctuation sequences of Section 9.4 provide a means for converting a block 
code into a sliding block code. Suppose, for example, that {X n } is a source 
with alphabet A and 7 jv is a block code, 7 jv : A N — ■> B N . (The dimensions 
of the input and output vector are assumed equal to simplify the discussion.) 
Typically B is binary. As has been argued, block codes are not stationary. 
One way to stationarize a block code is to use a procedure similar to that 
used to stationarize a block memoryless channel: Send long sequences of blocks 
with occasional random spacing to make the overall encoded process stationary. 
Thus, for example, one could use a sliding block code to produce a punctuation 
sequence {Z n } as in Corollary 9.4.2 which produces isolated 0’s followed by KN 
l’s and occasionally produces 2’s. The sliding block code uses 7 jv to encode a 
sequence of K source blocks , X^ +N , ■ ■ ■ , X„ + (k-i)n ^ anc ^ on ^ ^ ^ n = 0. 
For those rare times l when Zi = 2, the sliding block code produces an arbitrary 
symbol b* £ B. The resulting sliding block code inherits many of the properties 
of the original block code, as will be demonstrated when proving theorems 
for sliding block codes constructed in this manner. In fact this construction 
suffices for source coding theorems, but an additional property will be needed 
when treating the channel coding theorems. The shortcoming of the results of 
Lemma 9.4.4 and Corollary 9.4.2 is that important source events can depend 
on the punctuation sequence. In other words, probabilities can be changed by 
conditioning on the occurrence of Z n = 0 or the beginning of a block code word. 
In this section we modify the simple construction of Lemma 9.4.4 to effectively 
obtain a new punctuation sequence that is approximately independent of certain 
prespecified events. The result is a variation of the Rohlin-Kakutani theorem 
of ergodic theory [127] [71]. The development here is patterned after that in 
Shields [131]. 

We begin by recasting the punctuation sequence result in different terms. 
Given a stationary and ergodic source {X„} with a process distribution /. i and 
a punctuation sequence {Z n j as in Section 9.4, define the set F = {x : Zjy(x) = 
0}, where x € A°° is a two-sided sequence x = (• • • , X-i, Xo, x\, ■ ■ •). Let T 
denote the shift on this sequence space. Restating Corollary 9.4.2 yields the 
following. 

Lemma 9.5.1: Given 6 > 0 and an integer N, an L sufficiently large and a 
set F of sequences that is measurable with respect to (X_l, • • • ,Xl) with the 
following properties: 



(A) 

(B) 



The sets T l F , * = 0,1,---, iV — 1 are disjoint. 



l-<5 

N 



< t(f) < 



1 

N' 



N-l 

1-«<MU rF '>- 

i = 0 



(C) 
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So far all that has been done is to rephrase the punctuation result in more 
ergodic theory oriented terminology. One can think of the lemma as repre- 
senting sequence space as a “base” S together with its disjoint shifts T l S\ i = 
1, 2, • • • , N — 1, which make up most of the space, together with whatever is left 
over, a set G = Uilo 1 T l F, a set which has probability less than which will be 
called the “garbage set.” This picture is called a tower. The basic construction 
is pictured in Figure 9.1. 

G 



T n F 



t 3 f 



t 2 f 



TF 



F 



Figure 9.1: Rohlin-Kakutani Tower 

Next consider a partition V = {Pr, i = 0, 1, • • • , ||'P|| — 1} of A°°. One 
example would be the partition of a finite alphabet sequence space into its 
possible outputs at time 0, that is, Pi = {x : Xq = a.j} for * = 0, 1, • • • , ||A|| — 1. 
Another partition would be according to the output of a sliding block coding of 
x. The most important example, however, will be when there is a finite collection 
of important events that we wish to force to be approximately independent of 
the punctuation sequence and V is chosen so that the important events are 
unions of atoms of V . 

We now can state the main result of this section. 

Lemma 9.5.2: Given the assumptions of Lemma 9.5.1, L and F can be 
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chosen so that in addition to properties (A)-(C) it is also true that 
(D) 



KPi\F) = T{Pi\T l F)- l = 1, 2, • • • , N — 1, 


(9.11) 


N-l 




»(Pi\ F) = »(Pi\ U T k F) 


(9.12) 


k = 0 




ti(Pip\F) < ^(P-d- 


(9.13) 



Comment: Eq. (9.13) can be interpreted as stating that Pi and F are 
approximately independent since 1/N is approximately the probability of F. 
Only the upper bound is stated as it is all we need. Eq. (9.11) also implies that 
H{Pi f) F) is bound below by (/i(P,) — 6)/z(.F). 

Proof: Eq. (9.12) follows from (9.11) since 



JV-l 



KPi\ U r * F ) 
1—0 



M^DU^O lr T l F) 
MU,=o 'T'F) 



ZZo 'riT'F) 



= riPi\T l FMT'F) 

NKF) 

Eq. (9.13) follows from (9.12) since 



N—l 

- 53 T l F)^F). 
v 1=0 



N—l 

rtPiftF) = n{Pi\F)»{F) = »{Pi\ |J T k F)n(F) 

k=0 

N—l N—l 

= T(Pi\ U pkp )^ U TfcF ))> 

k—0 k—0 

since the T k F are disjoint and have equal probability, 



1 

N 



N-l 

t(p, n u Tkp 



k—0 



1 

N 



rip*)- 



The remainder of this section is devoted to proving (9.11). We begin by review- 
ing and developing some needed notation. 

Given a partition V , we define the label function 



\\r\\-i 

label'p{x) = 53 

2=0 

where as usual lp is the indicator function of a set P. Thus the label of a 
sequence is simply the index of the atom of the partition into which it falls. 
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As V partitions the input space into which sequences belong to atoms of V, 
T~ l V partitions the space according to which shifted sequences T l x belong to 
atoms of V, that is, x € T~ l Pi G T~ l V is equivalent to T l x € Pi and hence 
label-p(T l x ) = l. The join 

JV-l 

p N = \J T^V 

i = 0 

partitions the space into sequences sharing N labels in the following sense: Each 
atom Q of V N has the form 

Q = { x : label-p(x) = ko, label-p(Tx ) = k\, ■ ■ ■ , label'p{T N ~ 1 x) = k^ — 1} 

for some N tuple of integers k = (fco, • • • , k n — 1)- For this reason we will index 
the atoms of V N as Q k- Thus V N breaks up the sequence space into groups of 
sequences which have the same labels for N shifts. 

We first construct using Lemma 9.5.1 a huge tower of size KN » N, the 
height of the tower to be produced for this lemma. Let S denote the base of 
this original tower and let e by the probability of the garbage set. This height 
KN tower with base S will be used to construct a new tower of height N and 
a base F with the additional desired property. First consider the restriction of 
the partition V N to F defined by V N fj F = {QkfjF; all KN- tuples k with 
coordinates taking values in {0, 1, • • • , ||'P|| — 1}}. V N fj F divides up the original 
base according to the labels of N K shifts of base sequences. For each atom 
Qk fj A in this base partition, the sets {T z (Q k fj F); k = 0, 1, • • • , KN — 1} are 
disjoint and together form a column of the tower {T l F\ k = 0, 1, • • • , KN — 1}. 
A set of the form T l (Q k fj F) is called the Zth level of the column containing it. 
Observe that if y £ T*(Qk fj F), then y = T l u for some u€Q k f]F and T l u has 
label k/. Thus we consider ki to be the label of the column level T z (QkfjF). 
This complicated structure of columns and levels can be used to recover the 
original partition by 



Pj= u ( 9 - 14 ) 

i,k:fe;=j 

that is, Pj is the union of all column levels with label j together with that part 
of Pj in the garbage. We will focus on the pieces of Pj in the column levels as 
the garbage has very small probability. 

We wish to construct a new tower with base F so that the probability of Pi 
for any of N shifts of F is the same. To do this we form F dividing each column 
of the original tower into N equal parts. We collect a group of these parts to 
form F so that F will contain only one part at each level, the N shifts of F will 
be disjoint, and the union of the N shifts will almost contain all of the original 
tower. By using the equal probability parts the new base will have conditional 
probabilities for Pj given T l equal for all l, as will be shown. 

Consider the atom Q = Qk fj S' in the partition V N fj S of the base of the 
original tower. If the source is aperiodic in the sense of placing zero probability 
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on individual sequences, then the set Q can be divided into N disjoint sets of 
equal probability, say Wq , W\, ■■ ■ ,Wn-i- Define the set Fq by 



(K-2)N ( K-2)N 

Fq = ( U T iN W 0 )\J( U 



2 = 0 



2=0 



(K-2)N N—l (K—2)N 

( U T N - 1+iN W N - 1 )= U U T l+iN Wi. 



2=0 



1=0 



2=0 



Fq contains (K — 2) N shifts of Wo, of TWi, • • • of T l Wi , • • • and of T n ~ 1 Wn- i- 
Because it only takes TV-shifts of each small set and because it does not include 
the top N levels of the original column, shifting Fq fewer than N times causes 
no overlap, that is, T 1 Fq are disjoint for j = 0, 1, • • • , N — 1. The union of these 
sets contains all of the original column of the tower except possibly portions of 
the top and bottom TV — 1 levels (which the construction may not include) . The 
new base F is now defined to be the union of all of the p, s . The sets T l F 

are then disjoint (since all the pieces are) and contain all of the levels of the 
original tower except possibly the top and bottom N—l levels. Thus 



N-l (K-l)N-l (K-l)N-l 

m( U t ' f ) ^ m U = E ms) 

1=0 i=N i=N 



> K -2 = . 

KN N I<N 

by choosing e = 6 / 2 and K large this can be made larger than 1 — <5. Thus the 
new tower satisfies conditions (A)-(C) and we need only verify the new condition 
(D), that is, (9.11). We have that 



»(Pi\T l F) 



M P,C]T l F ) 



Since the denominator does not depend on Z, we need only show the numerator 
does not depend on l. From (9.14) applied to the original tower we have that 



^p,f]T i F)= M^kD^n^)’ 

j,le.:kj=i 



that is, the sum over all column levels (old tower) labeled i of the probability 
of the intersection of the column level and the Zth shift of the new base F. The 
intersection of a column level in the jth level of the original tower with any shift 
of F must be an intersection of that column level with the jth shift of one of 
the sets Wo,---,Wjv-i (which particular set depends on l). Whichever set is 
chosen, however, the probability within the sum has the form 

rv>n T l F) = n(T*(Q k rv>n T j w m ) 
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= = n(w m ), 

where the final step follows since W rn was originally chosen as a subset of Q k Pi <?• 
Since these subsets were all chosen to have equal probability, this last probability 
does not depend on m and hence on l and 

v(Ti(Q k f]S)f]T l F) = ^(QkD S) 

and hence 

^P,f]T l F)= Y, 

jM:kj=i 

which proves (9.11) since there is no dependence on l. This completes the proof 
of the lemma. □ 




Chapter 10 



Distortion 



10.1 Introduction 

We now turn to quantification of various notions of the distortion between ran- 
dom variables, vectors and processes. A distortion measure is not a “measure” 
in the sense used so far; it is an assignment of a nonnegative real number which 
indicates how bad an approximation one symbol or random object is of another; 
the smaller the distortion, the better the approximation. If the two objects cor- 
respond to the input and output of a communication system, then the distortion 
provides a measure of the performance of the system. Distortion measures need 
not have metric properties such as the triangle inequality and symmetry, but 
such properties can be exploited when available. We shall encounter several 
notions of distortion and a diversity of applications, with eventually the most 
important application being a measure of the performance of a communica- 
tions system by an average distortion between the input and output. Other 
applications include extensions of finite memory channels to channels which ap- 
proximate finite memory channels and different characterizations of the optimal 
performance of communications systems. 



10.2 Distortion and Fidelity Criteria 

Given two measurable spaces ( A,Ba ) and (B.Bb), a distortion measure on 
Ax B is a nonnegative measurable mapping p : A x B — > [0,oo) which assigns 
a real number p{ x, y ) to each x £ A and y £ B which can be thought of as the 
cost of reproducing x and y. The principal practical goal is to have a number by 
which the goodness or badness of communication systems can be compared. For 
example, if the input to a communication system is a random variable X £ A 
and the output is Y £ B, then one possible measure of the quality of the system 
is the average distortion Ep(X,Y). Ideally one would like a distortion measure 
to have three properties: 
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• It should be tractable so that one can do useful theory. 

• It should be computable so that it can be measured in real systems. 

• It should be subjectively meaningful in the sense that small (large) dis- 
tortion corresponds to good (bad) perceived quality. 

Unfortunately these requirements are often inconsistent and one is forced 
to compromise between tractability and subjective significance in the choice of 
distortion measures. Among the most popular choices for distortion measures 
are metrics or distances, but many practically important distortion measures 
are not metrics, e.g., they are not symmetric in their arguments or they do not 
satisfy a triangle inequality. An example of a metric distortion measure that 
will often be emphasized is that given when the input space A is a Polish space, 
a complete separable metric space under a metric p, and B is either A itself 
or a Borel subset of A. In this case the distortion measure is fundamental to 
the structure of the alphabet and the alphabets are standard since the space is 
Polish. 

Suppose next that we have a sequence of product spaces A n and B n for 
n = 1,2, • • • . A fidelity criterion p n , n = 1, 2, • • • is a sequence of distortion 
measures on A n x B n . If one has a pair random process, say {X n ,Y n }, then it 
will be of interest to find conditions under which there is a limiting per symbol 
distortion in the sense that 

P^{x,y) = lim -p n (x n ,y n ) 

n — >oo Tl 

exists. As one might guess, the distortion measures in the sequence often are 
interrelated. The simplest and most common example is that of an additive or 
single-letter fidelity criterion which has the form 

n — 1 

Pn{x n ,y n ) = ^pi(xi,2/i). 
i= 0 

Here if the pair process is AMS, then the limiting distortion will exist and 
it is invariant from the ergodic theorem. By far the bulk of the information 
theory literature considers only single-letter fidelity criteria and we will share 
this emphasis. We will point out, however, other examples where the basic 
methods and results apply. For example, if p n is subadditive in the sense that 

Pn(x n ,y n ) < p k (x k ,y k ) + p n . k (x n k - k ,yr k ), 

then stationarity of the pair process will ensure that n -1 p n converges from the 
subadditive ergodic theorem. For example, if d is a distortion measure on Ax B, 
then 

n— 1 

5 Z d (xi,yi) p 

i = 0 




Pn(x n ,y n ) 
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for p > 1 is subadditive from Minkowski’s inequality. 

As an even simpler example, if d is a distortion measure on Ax B, then the 
following fidelity criterion converges for AMS pair processes: 

1 1 n ~ 1 
-Pn{x n , y n )= -^2 f( d ( x i> Vi))- 
j=o 

This form often arises in the literature with d being a metric and / being a 
nonnegative nondecreasing function (sometimes assumed convex). 

The fidelity criteria introduced here all are context-free in that the distortion 
between n successive input/output samples of a pair process does not depend 
on samples occurring before or after these ?r-samples. Some work has been 
done on context-dependent distortion measures (see, e.g., [93]), but we do not 
consider their importance sufficient to merit the increased notational and tech- 
nical difficulties involved. Hence we shall consider only context-free distortion 
measures. 



10.3 Performance 

As a first application of the notion of distortion, we define a performance mea- 
sure of a communication system. Suppose that we have a communication system 
[p. /, v. g\ such that the overall input/output process is {X n ,X n }. For the mo- 
ment let p denote the corresponding distribution. Then one measure of the 
quality (or rather the lack thereof) of the communication system is the long 
term time average distortion per symbol between the input and output as de- 
termined by the fidelity criterion. Given two sequences x and x and a fidelity 
criterion p n \ n = 1,2,---, define the limiting sample average distortion or se- 
quence distortion by 



Poo{x,y) = limsup -p n (x n ,y n ). 

n—> oo Tl 

Define the performance of a communication system by the expected value of the 
limiting sample average distortion: 

A (p,f,v,g) = E pPoo = E p ( limsup -p n (X n , X n ) \ . (10.1) 

\ n— »■ oo Tl ) 



We will focus on two important special cases. The first is that of AMS sys- 
tems and additive fidelity criteria. A large majority of the information theory 
literature is devoted to additive distortion measures and this bias is reflected 
here. We also consider the case of subadditive distortion measures and systems 
that are either two-sided and AMS or are one-sided and stationary. Unhappily 
the overall AMS one-sided case cannot be handled as there is not yet a subad- 
ditive ergodic theorem for that case. In all of these cases we have that if p\ is 
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integrable with respect to the stationary mean process p, then 

Poo(x,y) = lim -p n {x n ,y n )\ p - a.e., (10.2) 

n—* oo fi 

and poo is an invariant function of its two arguments, i.e., 

Poo(TAX,T A y) = Poo{x,y)-, p- a.e.. (10.3) 

When a system and fidelity criterion are such that (10.2) and (10.3) are 
satisfied we say that we have a convergent fidelity criterion. We henceforth 
make this assumption. 

Since p x is invariant, we have from Lemma 6.3.1 of [50] that 

A = EpPoo = EppoQ. (10.4) 

If the fidelity criterion is additive, then we have from the stationarity of p 
that the performance is given by 

A = Ep Pl (X 0 ,Y 0 ). (10.5) 

If the fidelity criterion is subadditive, then this is replaced by 

A = inf ±E pPn (X n ,Y n ). (10.6) 

Assume for the remainder of this section that p n is an additive fidelity crite- 
rion. Suppose now that we now that p is TV-stationary; that is, if T = Ta x T a 
denotes the shift on the input/output space A T x A r , then the overall process 
is stationary with respect to T N . In this case 

A = ^Ep N (X N ,X N ). (10.7) 

We will have this N stationarity, for example, if the source and channel are 
stationary and the coders are TV-stationary, e.g., are length TV-block codes. More 
generally, the source could be TV-stationary, the first sequence coder (TV, K)- 
stationary, the channel TV-stationary (e.g., stationary), and the second sequence 
coder (K, TV)-stationary. 

We can also consider the behavior of the TV-shift more generally when the 
system is only AMS This will be useful when considering block codes. Suppose 
now that p is AMS with stationary mean p. Then from Theorem 7.3.1 of [50], 
p is also T n - AMS with an TV-stationary mean, say Pn- Applying the ergodic 
theorem to the TV shift then implies that if pat is pw-mtegrable, then 

1 n — 1 

lim “ P N ( x iNi Vm) = (10-8) 

n—> oo Tl z ' 
i = 0 

exists Pn (and hence alsop) almost everywhere. In addition, p ^ is TV-invariant 
and 



E pP W = EpnP W = E PnPn (X n ,Y n ). 



(10.9) 
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Comparison of (10.2) and (10.9) shows that p ^ = N p^, p- a.e. and hence 

A = 1 E PnPn (X n ,Y n ) = ±E pP W = Ep Pl (X 0 ,Y 0 ). (10.10) 

Given a notion of the performance of a communication system, we can now 
define the optimal performance achievable for trying to communicate a given 
source {X n } with distribution p over a channel v\ Suppose that £ is some class 
of sequence coders / : A T — > B T . For example, £ might consist of all sequence 
coders generated by block codes with some constraint or by finite-length sliding 
block codes. Similarly let V denote a class of sequence coders g : B' T — > A T . 
Define the optimal performance theoretically achievable or ORTA function for 
the source p, channel v, and code classes £ and V by 

A*(p,v,£,V) = inf A {[p, f,u,g]). (10.11) 

fe£,ge t> 

The goal of the coding theorems of information theory is to relate the OPTA 
function to (hopefully) computable functions of the source and channel. 

10.4 The rho-bar distortion 

In the previous sections it was pointed out that if one has a distortion measure 
p on two random objects X and Y and a joint distribution on the two random 
objects (and hence also marginal distributions for each), then a natural notion of 
the difference between the processes or the poorness of their mutual approxima- 
tion is the expected distortion Ep(X,Y). We now consider a different question: 
What if one does not have a joint probabilistic description of X and Y, but 
instead knows only their marginal distributions. What then is a natural no- 
tion of the distortion or poorness of approximation of the two random objects? 
In other words, we previously measured the distortion between two random 
variables whose stochastic connection was determined, possibly by a channel, a 
code, or a communication system. We now wish to find a similar quantity for 
the case when the two random objects are only described as individuals. One 
possible definition is to find the smallest possible distortion in the old sense 
consistent with the given information, that is, to minimize Ep{X 1 Y) over all 
joint distributions consistent with the given marginal distributions. Note that 
this will necessarily give a lower bound to the distortion achievable when any 
specific joint distribution is specified. 

To be precise, suppose that we have random variables X and Y with distri- 
butions Px and P Y and alphabets A and B , respectively. Let p be a distortion 
measure on A x B. Define the p- distortion (pronounced p-bar) between the 
random variables X and Y by 

p(Px,P Y ) = inf E p p(X,Y), 

pev 

Where V = V{Px,Py) is the collection of all measures on ( A x B,Bax Bb) 
with Px and Py as marginals; that is, 

p(A x F) = P Y (F ); F € B b , 
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and 

p{G x B) = P X (G); G £ B A . 

Note that V is not empty since, for example, it contains the product measure 
P x x P Y . 

Levenshtein [94] and Vasershtein [144] studied this quantity for the special 
case where A and B are the real line and p is the Euclidean distance. When as 
in their case the distortion is a metric or distance, the p-distortion is called the 
p-distance. Ornstein [116] developed the distance and many of its properties 
for the special case where A and B were common discrete spaces and p was the 
Hamming distance. In this case the p-distance is called the d-distance. R. L. 
Dobrushin has suggested that because of the common suffix in the names of its 
originators, this distance between distributions should be called the shtein or 
stein distance. 

The p-distortion can be extended to processes in a natural way. Suppose 
now that {X n } is a process with process distribution mx and that {Y n } is a 
process with process distribution my- Let Px n and P Y n denote the induced 
finite dimensional distributions. A fidelity criterion provides the distortion p„ 
between these n dimensional alphabets. Let p„ denote the corresponding p 
distortion between the n dimensional distributions. Then 

p(m x ,rti Y ) = sup -p n {Px™,PY' >); 

n Tl 



that is, the p-distortion between two processes is the maximum of the p-distortions 
per symbol between n-tuples drawn from the process. The properties of the p 
distance are developed in [57] [119] and a detailed development may be found 
in [50] . The following theorem summarizes the principal properties. 

Theorem 10.4.1: Suppose that we are given an additive fidelity criterion p n 
with a pseudo-metric per-letter distortion pi and suppose that both distributions 
mx and m Y are stationary and have the same standard alphabet. Then 

(a) linin^oo n~ 1 p n (Px^, Py n ) exists and equals sup„ n~ l p n {Px™ , -?¥»)• 

(b) p n and p are pseudo-metrics. If pi is a metric, then p n and p are metrics. 

(c) If mx and m Y are both i.i.d. , then p(rnx,m Y ) = pi(Px 0 , P Yo )- 

(d) Let V s = V s (mx , m Y ) denote the collection of all stationary distributions 

Px Y having mx and m Y as marginals, that is, distributions on {X n ,Y n } 
with coordinate processes {X n } and {F„} having the given distributions. 
Define the process distortion measure p' 

p(m x ,m. Y )= inf E PXY p(X 0 ,Y 0 ). 

PXY^Vs 



Then 



p(m x ,m Y ) = p(m x ,m Y ); 

that is, the limit of the finite dimensional minimizations is given by a 
minimization over stationary processes. 
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(e) Suppose that and my are both stationary and ergodic. Define V e = 
'P e (mx,rn. Y ) as the subset of V s containing only ergodic processes, then 

p{m x ,m Y ) = inf E pxY p(X 0l Y 0 ), 

PXY&Ve 



(f) Suppose that mx and my are both stationary and ergodic. Let Gx denote 
a collection of generic sequences for mx in the sense of Section 8.3 of [50]. 
Generic sequences are those along which the relative frequencies of a set of 
generating events all converge and hence by measuring relative frequencies 
on generic sequences one can deduce the underlying stationary and ergodic 
measure that produced the sequence. An AMS process produces generic 
sequences with probability 1. Similarly let Gy denote a set of generic 
sequences for my. Define the process distortion measure 



p"{m x ,m Y ) 



^ n— 1 

inf lim sup -Vpifso^o)- 

i6G x ,s6Gy n — >oo Tl 

i — 0 



Then 



p(m x ,m Y ) = p'(mx,m Y ); 

that is, the p distance gives the minimum long term time average distortion 
obtainable between generic sequences from the two sources. 



(g) The infima defining p n and p' are actually minima. 



10.5 d-bar Continuous Channels 

We can now generalize some of the notions of channels by using the p-distance 
to weaken the definitions. The first definition is the most important for chan- 
nel coding applications. We now confine interest to the d-bar distance, the 
p-distance for the special case of the Hamming distance: 

Pi 0,y) = d 1 {x,y) = 

Suppose that [A, u, B] is a discrete alphabet channel and let //" denote the 
restriction of the channel to B n , that is, the output distribution on Y n given 
an input sequence x. The channel is said to be d-continuous if for any e > 0 
there is an n 0 such that for all n > n o < e whenever Xi = x'i for 

i = 0, 1, • • • , n. Alternatively, v is d-continuous if 

lim sup sup sup d n 0”, i/”,) = 0, 

n^oo a“6A" j,i'£c(a") 

where c(a n ) is the rectangle defined as all x with = a*; i = 0, 1 , • • • , n — 1 . 
d-continuity implies the distributions on output n-tuples Y n given two input 
sequences are very close provided that the input sequences are identical over the 



JO if x = y 
\ 1 if x^y. 
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same time period and that n is large. This generalizes the notions of 0 or finite 
input memory and anticipation since the distributions need only approximate 
each other and do not have to be exactly the same. 

More generally we could consider p-continuous channels in a similar manner, 
but we will focus on the simpler discrete d-continuous channel. 

d-continuous channels possess continuity properties that will be useful for 
proving block and sliding block coding theorems. They are “continuous” in the 
sense that knowing the input with sufficiently high probability for a sufficiently 
long time also specifies the output with high probability. The following two 
lemmas make these ideas precise. 

Lemma 10.5.1: Suppose that x, x € c(a n ) and 

d'«,^)<d 2 . 

This is the case, for example, if the channel is d continuous and n is chosen 
sufficiently large. Then 

<(Gs) > v£{G) ~ <5 

and hence 

inf K( G f>)> SU P K{G)~5. 

xec(a n ) xec(a n ) 

Proof: From Theorem 10.4.1 the infima defining the d distance are actually 
minima and hence there is a pmf p on B n x B n such that 

E p{y n ,b n ) = <iv n ) 

b n eB n 

and 

E p(b n ,y n ) = ^(y n )-, 

b n £B n 

that is, p has v™ and as marginals, and 

- E p d n (Y n ,Y n )=d ( !#,!/?). 

n 

Using the Markov inequality we can write 

<{G S ) = p(Y n € G s ) 

> p(Y n e G and d„(F”, Y n ) <nS) = 1 - p{Y n G or d„(F n , Y n ) > nS) 

> 1 -p(Y n # G) — p(d n (Y n ,Y n ) > nS) > i/?(G) - ^E(n~ 1 d n (Y n , Y n )) 

>^(G)-S 

proving the first statement. The second statement follows from the first. □ 

Next suppose that [G, /i, U) is a stationary source, / is a stationary encoder 
which could correspond to a finite length sliding block encoder or to an infinite 
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length one, v is a stationary channel, and g is a length to sliding block decoder. 
The probability of error for the resulting hookup is defined by 

P e (v,v,f,g) = Pr(t / o ± U 0 ) = gv{E) = J dn{u)v f ( u) {E u ), 

where E is the error event {u, y : uo ^ £/ m (F_<; m )} and E u = {y : (u, y) G E} is 
the section of E at u. 

Lemma 10.5.2: Given a stationary channel v, a stationary source [G, /i, U], 
a length to sliding block decoder, and two encoders / and <j>, then for any positive 
integer r 

I Pe{v,v,f,9) - Pe(g, v,<j>,g ) | 

Tfl — 

< 1- r Pr(/ (j> ) + m max sup d r (y x , v x ,). 

r a r ^A r Xj x'Ec(a r ) 

Proof: Define A = {u : f(u) = </>(u)} and 

1 1 

A r = {u : f{T i u) = i = 0, 1 • • • , r - 1} = p| T A. 

i = 0 



From the union bound 



KK) < ^(A c ) = rPr(/ ^ cf). (10.12) 

From stationarity, if g = g m (Yf ”) then 

Pe{v,v,f,g) = J dn(u)v f ( u) (y : g m (y- q ) ± u 0 ) 

1 r_1 f 

= - Y / d t i i u ) v f{u) (y : gm(yr- q ) ^ u o ) 

i — 0 J 

< - + - Y / Mu>fw(y r ■■ gmivT-q ) ^ «0 + MA‘). (10.13) 

r r i=g •'A- 

Fix u G A r and let yield dr(v r f{u)Mu) ); that is, Y w r Vu{y r ,w r ) = v r f{u) {y r ), 
E v rPu{y r ,w r ) = v r Hu) {w r ), and 

1 r ” 1 

- ^2Pu{y r ,w r : yi ± Wi ) = d r (v r f{u)Mu) ). (10.14) 



2 — 0 



We have that 



“ Y U f(u)(y r '■ 9m{yT- q ) ^ u i) = \ Y Pu ( yr ’ u,r '■ 9m{y?- q ) ± u,) 
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< \ E P u(y r ,w r : g m {y?_ q ) ± w™ q ) + 1 £ Pu(y r ,w r : g m {w?_ q ) * m) 

i=q i=q 

< \,^2,Pu{y\w r : y[_ q ± w r i_ q ) + P e {p,v,4>,g) 

i=q 

^ r—q i-q-\-m 

< - Pv(y r ’ u,r : yj ^ w j ) + p e(M, c <j), g) 

i=q j—i—q 

< md r (v r fiu) ,i>^ u) ) + P e (/j,, v , 

which with (10.12)-(10.14) proves the lemma. □ 

The following corollary states that the probability of error using sliding block 
codes over a d-continuous channel is a continuous function of the encoder as 
measured by the metric on encoders given by the probability of disagreement of 
the outputs of two encoders. 

Corollary 10.5.1: Given a stationary d-continuous channel v and a finite 
length decoder g m : B m — > A, then given e > 0 there is a <5 > 0 so that if / and 
(f> are two stationary encoders such that Pr(/ yf g) < S, then 



I Pe{p,v,f,g) - P e {n,v,(j>,g ) | < e. 

Proof: Fix e > 0 and choose r so large that 

max sup d r (v r x ,v r x ,) < 

a x,x'&c(a r ) 3 m 

m e 

< X, 

r 3 

and choose S = e/(3r). Then Lemma 10.5.2 implies that 
I Pe(p,v,f,g) - P e (/i,is,(/),g)\ < e. □ 

Given an arbitrary channel [A,v,B\, we can define for any block length 
N a closely related CBI channel [A, v. B\ as the CBI channel with the same 
probabilities on output JV-blocks, that is, the same conditional probabilities for 
Y k N N given x, but having conditionally independent blocks. We shall call v the 
N-CBI approximation to v. A channel v is said to be conditionally almost block 
independent or CABI if given e there is an Nq such that for any N > Nq there 
is an AIo such that for any x and any 1V-CBI approximation v to v 

d{v™,v™) < e, all M > M 0 , 

where v!jf denotes the restriction of v x to Bg, that is, the output distribution on 
Y n given x. A CABI channel is one such that the output distribution is close (in 
a d sense) to that of the 7V-CBI approximation provided that N is big enough. 
CABI channels were introduced by Neuhoff and Shields [110] who provided 
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several examples alternative characterizations of the class. In particular they 
showed that finite memory channels are both d-continuous and CABI. Their 
principal result, however, requires the notion of the d distance between channels. 
Given two channels [A. v, B\ and [A, i/,B], define the d distance between the 
channels to be 

d(v, u') = limsupsup d(v x , v' x ). 

n—* oo x 

Neulroff and Shields [110] showed that the class of CABI channels is exactly 
the class of primitive channels together with the d limits of such channels. 

10.6 The Distortion-Rate Function 

We close this chapter on distortion, approximation, and performance with the 
introduction and discussion of Shannon’s distortion-rate function. This function 
(or functional) of the source and distortion measure will play a fundamental role 
in evaluating the OPTA functions. In fact, it can be considered as a form of 
information theoretic OPTA. Suppose now that we are given a source [A, p] 
and a fidelity criterion p n \ n = 1,2, ••• defined on A x A, where A is called 
the reproduction alphabet. Then the Shannon distortion rate function (DRF) is 
defined in terms of a nonnegative parameter called rate by 

D(R, /j) = limsup ^-D N (R,fi N ) 

oo -W 



where 

D N (R,p N ) = inf E pN p N (X N ,Y N ), 
p n ek n (r,h n ) 

where TZn(R, p N ) is the collection of all distributions p N for the coordinate 
random vectors X N and Y N on the space (A N x A N , B^ x B'J) with the 
properties that 

(1) p N induces the given marginal p N \ that is, p N (A N x F) = p N (F) for all 

F G B%, and 

(2) the mutual information satisfies 

^I pN (X N ;X N )<R. 

If TZn(R, p N ) is empty, then D]y(R,p N ) is oo. Dm is called the Nth order 
distortion-rate function. 

Lemma 10.6.1: Dm(R , P) and D(R, p) are nonnegative convex (J functions 
of R and hence are continuous in R for R > 0. 

Proof: Nonnegativity is obvious from the nonnegativity of distortion. Sup- 
pose that pi G TZN{Ri , P N )'. i = 1,2 yields 

E Pi p N (X N , Y n ) < D N (Ri , p) + e. 
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From Corollary 5.5.5 mutual information is a convex (J function of the condi- 
tional distribution and hence if p = Xpi + (1 — X)p 2 , then 

Ip "S: XIp i + (1 — X)Ip 2 < XRi + (1 — X)R 2 



and hence p G TZn(XRi + (1 — X)R 2 ) and therefore 



D n (XR 1 + (1 - X)R 2 ) < E pPn (X n ,Y n ) 

= XEp lPN (X N ,Y N ) + (1 - A )E P2 p N (X N ,Y N ) 

< XDn(Ri, p) + (1 — X)Dn(R 2 , /i). 

Since D(R,p) is the limit of Dpf(R,p), it too is convex. It is well known from 
real analysis that convex functions are continuous except possibly at their end 
points. □ 

The following lemma shows that when the underlying source is stationary 
and the fidelity criterion is subadditive (e.g., additive), then the limit defining 
D{R 1 p) is an infimum. 

Lemma 10.6.2: If the source p is stationary and the fidelity criterion is 
subadditive, then 

D(R, p) = lim D N (R,p) = inf ^-D N (R,p). 

N^oo N iV 

Proof: Fix N and n < N and let p n G 7 Z n (R, p n ) yield 

E pn p n (X n ,Y n )<D n (R,p n )+^ 

and let p N ~ n G lZ]y_ n (R, p N ~ n ) yield 

E pN -r>p N _ n (X N - n ,Y N ~ n ) < D N _ n (R,p N ~ n ) + 

p n together with p n implies a regular conditional probability q{F\x n ), F G £>”. 
Similarly pn-h and p N ~ n imply a regular conditional probability r(G\x N ~ n ). 
Define now a regular conditional probability t(-\x N ) by its values on rectangles 
as 

t(F x G\x n ) = q{F\x n )r(G\Xn~ n )] FGB n Al GG B»~ n . 

Note that this is the finite dimensional analog of a block memoryless channel 
with two blocks. Let p N = p N t be the distribution induced by p and t. Then 
exactly as in Lemma 9.4.2 we have because of the conditional independence that 

V (X N ; Y n ) < Ip N {X n -, Y n ) + I pN {X%~ n - Y^~ n ) 



and hence from stationarity 

IpN ( X N ; Y n ) < I p n ( X n ■ Y n ) + IpN —n (X N ~ n -, Y N ~ U ) 
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< nR + (N — n)R = NR 
so that p N € TZn{R, p N )- Thus 

D N (R,p N ) < E pN p N (X N ,Y N ) < E pN ( Pn (X n ,Y n ) + PN _ n (X^- n ,Y^- n )) 

= E p »p n (X n , Y n ) + E pN -np N - n (X N ~ n , Y N ~ n ) 

< D n (R, p n ) + D N _ n (R, p N ~ n ) + e. 

Thus since e is arbitrary we have shown that if d n = D n (R,p n ), then 

cIn ^ d n dpj— nj n ft N , 

that is, the sequence d n is subadditive. The lemma then follows immediately 
from Lemma 7.5.1 of [50]. □ 

As with the p distance, there are alternative characterizations of the distortion- 
rate function when the process is stationary. The remainder of this section is 
devoted to developing these results. The idea of an SBM channel will play 
an important role in relating nth order distortion-rate functions to the process 
definitions. We henceforth assume that the input source p is stationary and 
we confine interest to additive fidelity criteria based on a per-letter distortion 
P = P\- 

The basic process DRF is defined by 

D s (R,p)= inf E p p(X 0 ,Y 0 ), 
pgTZs(R,ij.) 

where 'R S {R 1 p) is the collection of all stationary processes p having p as an 
input distribution and having mutual information rate I. p = I p {X ; Y) < R. The 
original idea of a process rate-distortion function was due to Kolmogorov and 
his colleagues [87] [45] (see also [23]). The idea was later elaborated by Marton 
[101] and Gray, Neuhoff, and Omura [55]. 

Recalling that the L 1 ergodic theorem for information density holds when 
Ip = /*; that is, the two principal definitions of mutual information rate yield 
the same value, we also define the process DRF 

D* s (R,p)= inf E pP (X 0 ,Y 0 ), 

penUR.O 

where 1ZI(R, p) is the collection of all stationary processes p having p as an 
input distribution, having mutual information rate I p < i?, and having I p = I*. 

If p is both stationary and ergodic, define the corresponding ergodic process 
DRF’s by 

D e {R,p)= _inf E pP (X 0 ,Y 0 ), 

pGTZe(R.p) 

D* e (R,p)= inf E p p(X 0 ,Y 0 ), 

P eni(R,u) 

where 1Z e (R,p) is the subset of 1t s (R,p) containing only ergodic measures and 
Rl(R,p) is the subset of R*(R,p) containing only ergodic measures. 
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Theorem 10.6.1: Given a stationary source which possesses a reference 
letter in the sense that there exists a letter a* £ A such that 



E p p{X o, a*) < p* < oo. 



(10.15) 



Fix R > 0. If D(R,p) < oo, then 

D(R,p) = D s (R,p) = D* s (R,p). 

If in addition p is ergodic, then also 

D(R,p)=D e (R,p)=D* e (R,p). 

The proof of the theorem depends strongly on the relations among distortion 
and mutual information for vectors and for SBM channels. These are stated 
and proved in the following lemma, the proof of which is straightforward but 
somewhat tedious. The theorem is proved after the lemma. 

Lemma 10.6.3: Let p be the process distribution of a stationary source 
{X n }. Let p n ; n = 1,2, ••• be a subadditive (e.g., additive) fidelity criterion. 
Suppose that there is a reference letter a* £ A for which (10.15) holds. Let p N be 
a measure on (A w x A N , B^ x B ^ ) having p N as input marginal; that is, p N ( F x 

A N ) = p N {F) for F £ B\{ . Let q denote the induced conditional probability 
measure; that is, q x N ( F ), x N £ A N , F £ B 1 J, is a regular conditional probability 
measure. (This exists because the spaces are standard.) We abbreviate this 
relationship as p N = p N q. Let X N ,Y N denote the coordinate functions on 
A N x A N and suppose that 

E pN ^ PN (X N ,Y N )<D (10.16) 

and 

^I pN (X N -Y N )<R. (10.17) 

If v is an ( N , 6) SBM channel induced by q as in Example 9.4.11 and if p = pv 
is the resulting hookup and {X n ,Y n } the input/output pair process, then 

^E pPN (X N , Y n ) < D + p*5 (10.18) 

and 

~I p (X-Y) = i;{X-Y) <R- (10.19) 

that is, the resulting mutual information rate of the induced stationary process 
satisfies the same inequality as the vector mutual information and the resulting 
distortion approximately satisfies the vector inequality provided S is sufficiently 
small. Observe that if the fidelity criterion is additive, the (10.18) becomes 



E pPl (X 0 ,Y 0 ) <D + p*6. 
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Proof: We first consider the distortion as it is easier to handle. Since the 
SBM channel is stationary and the source is stationary, the hookup p is station- 
ary and 

-E pPn (X n ,Y n )=- [ dm z {z)E Pz p n (X n ,Y n ), 

Tl Tl J 

where p z is the conditional distribution of {X n ,Y n } given {Z n }. Note that the 
above formula reduces to E p p(X 0 , Yq) if the fidelity criterion is additive because 
of the stationarity. Given z, define Jq{z) to be the collection of indices of z n for 
which Zi is not in an TV-cell. (See the discussion in Example 9.4.11.) Let J"(z) 
be the collection of indices for which z, begins an TV-cell. If we define the event 
G = {z : zo begins an N — cell}, then i G Jf(z) if T l z £ G. From Corollary 
9.4.3 mz{G) < N ~ 1 . Since p is stationary and {X n } and { Z n } are mutually 
independent, 

nE Pz p n (X n ,Y n )< Y E Pz p{X u a*) + N Y E Pz p(X», Y?) 

i£Jq{z) i£J™(z) 



= Y 1 G'(T i z)p* + Y E^pnIgCTz). 

i — 0 i — 0 

Since mz is stationary, integrating the above we have that 

E pPl (X 0 ,Y 0 ) = p*m z (G c ) + Nm. z (G)E p N p N 

< p*5 + E p Np N , 

proving (10.18). 

Let r m and t rn denote asymptotically accurate quantizers on A and A; that 
is, as in Corollary 6.2.1 define 

X n = r m (X) n = (r m (X 0 ), • • • , r m {X n ^)) 

and similarly define Y n = t m (Y) n . Then 

I(r m (X) n -t m (Y) n ) - I(X n ;Y n ) 

m — »oo 

and 

I(r m (X);t m (Y)) - I*(X;Y). 

m—> oo 

We wish to prove that 

f(X;Y)= lim lim -I(r m (X) n ; t m (Y) n ) 

n — >oo m — >-oo 77, 

= lim lim —I(r m (X) n -t m (Y) n ). = I*(X;Y) 

m—> 00 n —> 00 Tl 

Since / > /*, we must show that 
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< lim lim -I(r m (X) n ; t m (Y) n ). 

m — >-oo n— >oo Tl 

We have that 

I{X n - Y n ) = I{{X n , Z n )- Y n ) - I{Z n , Y n \X n ) 

and 

7((X n , Z n ); F n ) = I{X n - Y n \Z n ) + 7(F”; Z n ) = 7(X n ; Y n |Z n ) 
since X n and Z n are independent. Similarly, 

I(Z n -Y n \X n ) = H{Z n \X n ) ~ H{Z n \X n ,Y n ) 

= H(Z n ) - H(Z n \X n , Y n ) = I{Z n - (. X n , y n )). 

Thus we need to show that 

lim lim (-I(r m (X) n Orn(Y) n \Z n )--I(Z n ,(r m (X) n ,t m (Y) n ))) 

n — »oo m — >-oo \ 77, 77, / 

< lim lim f-/(r m (X)”;t m (y)' l |Z")--J(Z",(r m (X)",t m (F)”))y 
m—*oo n —> 00 \ 77, 77, / 

Since Z n has a finite alphabet, the limits of n~ 1 I(Z n , (r m (X) n , t m {Y) n )) are 
the same regardless of the order from Theorem 6.4.1. Thus / will equal I* if we 
can show that 

I(X-Y\Z)= lim lim -I(r m (X) n ; t m (Y) n \Z n ) 

n — >-oo m—> oo TL 

< lim lim -I{r m {X) n -tm{Y) n \Z n ) = I*{X-Y\Z). (10.20) 

m—>oo n —* oo 77, 

This we now proceed to do. From Lemma 5.5.7 we can write 

I(r m (X) n -tm(Y) n \Z n ) = J I(r m (X) n -tm(Y) n \Z n = z n )dP z ^z n ). 



Abbreviate I(r rn {X) n \t m {Y) n \Z n = z 11 ) to I z (X n ;Y n ). This is simply the 
mutual information between X n and Y n under the distribution for ( X n ,Y n ) 
given a particular random blocking sequence z. We have that 

I z {X n -Y n ) = H z {Y n ) - H z (Y n \X n ). 



Given z, let Jg (z) be as before. Let (z) denote the collection of all indices i 
of Zi for which Zj begins an N cell except for the final such index (which may 
begin an fV-cell not completed within z n ). Thus (z) is the same as Jf(z) 
except that the largest index in the latter collection may have been removed 
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if the resulting IV-cell was not completed within the n-tuple. We have using 
standard entropy relations that 



Iz{X n \ Y n ) > Y - H z (Y i \Y i ,X i+1 )') 

ieJSO) 

+ Y (H z (Yi N \Y l )~H z (Y z N \Y\X l+N )y (10.21) 

For i G Jq(z ), however, Yi is a* with probability one and hence 
h^yY) < H Z (Y) < H Z (Y) = 0 

and 

H z {Y\Y\X l+1 ) < H z (Yi) < H z (Yi) = 0. 

Thus we have the bound 

I z {X n -Y n ) > Y ( R Z & N \Y l ) - H x (YS f \Y i ,X i+N j) . 

i£j%(z) 

i£jg(z) 

> Y (h^-X^-I^Yf-Yi)), (10.22) 

i€J%{z) 

where the last inequality follows from the fact that I(U; (V, W)) > I(U\ V). 

For i G J?{z) we have by construction and the stationarity of fi that 



I z {X?-Y % N ) = I p »{X N \Y N ). (10.23) 



As before let G = {z : Zq begins an N — cell}. Then i G ( 2 ) if T l z G G and 
i < n — N and we can write 



1 

n 



I z (X n ; Y n ) > 



n—N—1 

-I p „(X N ;Y N ) Y M?**) 

n z ' 

2 — 0 



1 n—N—1 

— y uy^nNg^z). 

i = 0 

All of the above terms are measurable functions of 2 and are nonnegative. Hence 
they are integrable (although we do not yet know if the integral is finite) and 
we have that 

1 „ /V 

-I(X n ;Y n ) > I pn (X N ;Y N )m z (G) 

n n 



1 

n 




dm z (z)I z (Y i N N i )lG(Tz). 
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To continue we use the fact that since the processes are stationary, we can 
consider it to be a two sided process (if it is one sided, we can imbed it in a two 
sided process with the same probabilities on rectangles). By construction 

n = I T ‘ Z (Y 0 N ; (y_i, • • • , y_i)) 

and hence since m z is stationary we can change variables to obtain 

1 A A A A in 

-I{X n -Y n ) > I p n (X N ; Y N )m z (G) 

n n 

.j n—N—l » 

— E dm z {z)I z {Y 0 N -,{Y_ l ,---,Y_ 1 ))l G {z). 

Tl n J 

i—O 

We obtain a further bound from the inequalities 

I z (Y 0 n ; (Y_i, • • • , y_i)) < I z (Y 0 n ; {Y_ u • • • , Y_ x )) < I Z (Y 0 N ; Y~) 

where Y~ = (• • • , y_ 2 ,y-i). Since I Z (Y<^ : Y~) is measurable and nonnegative, 
its integral is defined and hence 

lim -I{X n -Y n \Z n ) > I pn (X N -Y N )m z (G) - [ dm. z (z)I z (Y 0 N ;Y~). 

We can now take the limit as m — » oo to obtain 

r{X-Y\Z)>I pn {X N -Y N )m z {G)-f dm. z (z)I z (Y 0 N -Y~). (10.24) 

JG 

This provides half of what we need. 

Analogous to (10.21) we have the upper bound 

I z {X n -Y n )< E (l z {Yi N ;(Y\X l+N ))- I Z (Y^-,Y 1 )^ (10.25) 

ie Ji(z) 

We note in passing that the use of J\ here assumes that we are dealing with a 
one sided channel and hence there is no contribution to the information from 
any initial symbols not contained in the first N- cell. In the two sided case time 
0 could occur in the middle of an IV-cell and one could fix the upper bound by 
adding the first index less than 0 for which z-i begins an iV-cell to the above 
sum. This term has no affect on the limits. Taking the limits as m — » oo using 
Lemma 5.5.1 we have that 

i z (x n - Y n ) < E (W"; * i+w )) - h{Y?\ y 1 )) • 

Given Z n = z n and i € (X l ,Y l ) — > X™ — * YA forms a Markov chain 

because of the conditional independence and hence from Lemma 5.5.2 and Corol- 
lary 5.5.3 



I Z (YX , (Y\X i+N )) = I Z {X»- Y?) = I pN {X N - Y n ). 
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Thus we have the upper bound 



1 1 n — 1 1 n— 1 

-7 2 (X ra ;Y”) < -I pN (X N -Y N )y^l G (Tz)--y j I z (Y i N -Y')l G (T i z). 
n n z ' n z ' 

4=0 4=0 

Taking expectations and using stationarity as before we find that 

-I(X n ;Y n \Z n ) < I„n(X n ; Y N )m z (G) 
n 

i n_1 r 

-- E / 

n i=0 •'G 

Taking the limit as n — > oo using Lemma 5.6.1 yields 

/(X;T|Z)<V(X iv ;F Ar )m z (G)- [ dm z (z)I z (Y 0 N ;Y"). (10.26) 

JG 



Combining this with (10.24) proves that 

i{x-Y\z)<r{x-Y\z) 

and hence that 

J(X;Y)=/*(X;Y). 

It also proves that 



I(X- Y) = J(X; Y\Z) - I(Z ; (X, Y)) < /(X; Y\Z) 

< I p N (X N ; Y N )m z {G) < 1 I p *(X N -,Y N ) 

using Corollary 9.4.3 to bound mx(G). This proves (10.19). □ 
Proof of the theorem: We have immediately that 



n* e (R,ricn:(R,[i)cn s (R,v) 



and 



R* e (R,n) c n e {R,^) c n a {R,n), 

and hence we have for stationary sources that 

D a (R,v.)<D* a (R,n) 

and for ergodic sources that 

D a (R,ri<D* a (R,v<)<D* e (R,v) 



and 

D s (R,ii) < D e (R,n) < D* e (R,n). 



(10.27) 



(10.28) 

(10.29) 
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We next prove that 

D a (R,p) > D(R,n). (10.30) 

If D s (R,p) is infinite, the inequality is obvious. Otherwise fix e > 0 and choose 
ap€ TZ S (R, p) for which E p pi(X 0 , Yq) < D S (R , p) + e and fix S > 0 and choose 
to so large that for n > to we have that 

n~ 1 I p (X n - Y n ) < I P (X ; Y) + 5 < R + 5. 

For n > to we therefore have that p n £ lZ n (R + $, p n ) and hence 

D S (R, p) + e = E p n.p n > D n (R + 6, p) > D(R + 5, p). 

From Lemma 10.6.1 D(R, p) is continuous in R and hence (10.30) is proved. 
Lastly, fix e > 0 and choose N so large and p N G 1Zn(R, p N ) so that 

E p N p N < Dn(R, p N ) + — < D(R } p) + — . 

Construct the corresponding (TV, d)-SBM channel as in Example 9.4.11 with S 
small enough to ensure that Sp* < e/3. Then from Lemma 10.6.2 we have 
that the resulting hookup p is stationary and that I p = I* < R and hence 
p £ TZ*(R,p) C H s {R,p). Furthermore, if p is ergodic then so is p and hence 
p £ 7 Z*(R,p) C H e (R,p). From Lemma 10.6.2 the resulting distortion is 

E p Pi(Xq, Yo) < E p N pn + p*S < D(R 1 p) + e. 

Since e > 0 this implies the existence of a p £ TZ*(R,p) (p £ 7 Z*(R,p) if 
p is ergodic) yielding E p pi(X 0 ,Y 0 ) arbitrarily close to D(R,p. Thus for any 
stationary source 

D*(R,p)<D(R,p) 

and for any ergodic source 



D* e (R,p)<D(R,p). 

With (10.27)-(10.30) this completes the proof. □ 

The previous lemma is technical but important. It permits the construction 
of a stationary and ergodic pair process having rate and distortion near that 
of that for a finite dimensional vector described by the original source and a 
finite-dimensional conditional probability. 
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Source Coding Theorems 



11.1 Source Coding and Channel Coding 

In this chapter and the next we develop the basic coding theorems of information 
theory. As is traditional, we consider two important special cases first and then 
later form the overall result by combining these special cases. In the first case 
we assume that the channel is noiseless, but it is constrained in the sense that 
it can only pass R bits per input symbol to the receiver. Since this is usually 
insufficient for the receiver to perfectly recover the source sequence, we attempt 
to code the source so that the receiver can recover it with as little distortion as 
possible. This leads to the theory of source coding or source coding subject to 
a fidelity criterion or data compression, where the latter name reflects the fact 
that sources with infinite or very large entropy are “compressed” to fit across the 
given communication link. In the next chapter we ignore the source and focus 
on a discrete alphabet channel and construct codes that can communicate any of 
a finite number of messages with small probability of error and we quantify how 
large the message set can be. This operation is called channel coding or error 
control coding. We then develop joint source and channel codes which combine 
source coding and channel coding so as to code a given source for communication 
over a given channel so as to minimize average distortion. The ad hoc division 
into two forms of coding is convenient and will permit performance near that of 
the OPTA function for the codes considered. 

11.2 Block Source Codes for AMS Sources 

We first consider a particular class of codes: block codes. For the time being 
we also concentrate on additive distortion measures. Extensions to subadditive 
distortion measures will be considered later. Let {X n } be a source with a 
standard alphabet A. Recall that an (N, K) block code of a source {X n } maps 
successive nonoverlapping input vectors {X^ N } into successive channel vectors 
^ nK = where a : A N — > B K is called the source encoder. We assume 
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that the channel is noiseless, but that it is constrained in the sense that N source 
time units corresponds to the same amount of physical time as K channel time 
units and that 

K log | |B|| / D 
N ~ R ' 

where the inequality can be made arbitrarily close to equality by taking N and 
K large enough subject to the physical stationarity constraint. R is called the 
source coding rate or resolution in bits or nats per input symbol. We may wish 
to change the values of N and K , but the rate is fixed. 

A reproduction or approximation of the original source is obtained by a 
source decoder, which we also assume to be a block code. The decoder is a 
mapping ft : B K — > A N which forms the reproduction process {X n } via = 
(KU^y, n = 1,2, In general we could have a reproduction dimension 
different from that of the input vectors provided they corresponded to the same 
amount of physical time and a suitable distortion measure was defined. We will 
make the simplifying assumption that they are the same, however. 

Because N source symbols are mapped into N reproduction symbols, we 
will often refer to N alone as the block length of the source code. Observe that 
the resulting sequence coder is TV-stationary. Our immediate goal is now the 
following: Let £ and V denote the collection of all block codes with rate no 
greater than R and let v be the given channel. What is the OPTA function 
A(p,£,v,T>) for this system? Our first step toward evaluating the OPTA is to 
find a simpler and equivalent expression for the current special case. 

Given a source code consisting of encoder a and decoder (3, define the code- 
book to be 

C = { all p(u K );u K G B k }, 

that is, the collection of all possible reproduction vectors available to the re- 
ceiver. For convenience we can index these words as 

c = {l UO = lj 2, • • • , M}, 

where TV -1 log M < R by construction. Observe that if we are given only 
a decoder /3 or, equivalently, a codebook, and if our goal is to minimize the 
average distortion for the current block, then no encoder can do better than 
the encoder a* which maps an input word x N into the minimum distortion 
available reproduction word, that is, define a*(x N ) to be the u K minimizing 
Pn(x n , P(u K )), an assignment we denote by 

a*(x N ) — min.- 1 pn(x n , f3(u K )) . 

U K 

Observe that by construction we therefore have that 

Pn(x n ,P{a*(x N ))) = min p N (x N ,y) 

3/ec 

and the overall mapping of x N into a reproduction is a minimum distortion or 
nearest neighbor mapping. Define 

p N (x N ,C) = min p N (x N ,y). 
y&C 
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To formally prove that this is the best decoder, observe that if the source p is 
AMS and p is the joint distribution of the source and reproduction, then p is also 
AMS. This follows since the channel induced by the block code is TV-stationary 
and hence also AMS with respect to T N . This means that p is AMS with respect 
to T n which in turn implies that it is AMS with respect to T (Theorem 7.3.1 of 
[50]). Letting p denote the stationary mean of p and p N denote the TV-stationary 
mean, we then have from (10.10) that for any block codes with codebook C 

A = ^E PnPn (X n ,Y n ) > ±E PnPn (X n ,C), 

with equality if the minimum distortion encoder is used. For this reason we can 
confine interest to block codes specified by a codebook: the encoder produces 
the index of the minimum distortion codeword for the observed vector and the 
decoder is a table lookup producing the codeword being indexed. A code of this 
type is also called a vector quantizer or block quantizer. Denote the performance 
of the block code with codebook C on the source p by 

p(C,p) — A — Ep Poo . 

Lemma 11.2.1: Given an AMS source p and a block length TV code book 
C, let pn denote the TV-stationary mean of p (which exists from Corollary 7.3.1 
of [50]), let p denote the induced input/output distribution, and let p and pN 
denote its stationary mean and TV-stationary mean, respectively. Then 

p(C,p) = E pPl (X 0 ,Y 0 ) = ^E PnPn (X n ,Y n ) 

= ~^ e unPn(X n ,C) = p(C,p N ). 

Proof: The first two equalities follow from (10.10), the next from the use of 
the minimum distortion encoder, the last from the definition of the performance 
of a block code. □ 

It need not be true in general that p(C,p) equal p(C,p). For example, if p 
produces a single periodic waveform with period TV and C consists of a single 
period, then p(C,p) = 0 and p(C,p) > 0. It is the TV-stationary mean and not 
the stationary mean that is most useful for studying an TV-stationary code. 

We now define the OPTA for block codes to be 

6(R,p) = A *(p,i',£, / D) = inf S N (R,p), 

= C€ mf R)P ( C,rt. 

where v is the noiseless channel as described previously, £ and V are classes 
of block codes for the channel, and IC(N,R) is the class of all block length TV 
codebooks C with 



^ log Ill’ll < R- 
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S(R, p) is called the block source coding OPT A or the operational block coding 
distortion-rate function. 

Corollary 11.2.1: Given an AMS source p, then for any N and i = 

0,1,---, iV — 1 

6 N (R,pT-') = 5 N (R,p N T~'). 

Proof: For i = 0 the result is immediate from the lemma. For i ^ 0 it follows 
from the lemma and the fact that the TV-stationary mean of pT~ l is p^T -1 (as 
is easily verified from the definitions). □ 



Reference Letters 

Many of the source coding results will require a technical condition that is 
a generalization of reference letter condition of Theorem 10.6.1 for stationary 
sources. An AMS source p is said to have a reference letter a* £ A with respect 
to a distortion measure p = pi on A x A if 

supE /J , T - n p(X 0 ,a*) = sup E fl p(X n ,a*) = p* < oo, (11.1) 

n n 

that is, there exists a letter for which E ll p(X n , a*) is uniformly bounded above. 
If we define for any k the vector a* k = (a* , a* , ■ ■ ■ , a*) consisting of k a*’s, then 
(11.1) implies that 



sup E /lT - n jp k (X k ,a* k ) < p* < oo. (11.2) 

We assume for convenience that any block code of length N contains the 
reference vector a* N . This ensures that pn(x n ,C) < Pn(x n ,a* N ) and hence 
that pn{x n ,C) is bounded above by a /z-integrable function and hence is itself 
/z-integrable. This implies that 

S(R,p) < S N (R,p) < p*. (11.3) 

The reference letter also works for the stationary mean source p since 

^ n— 1 

lim - V' p{xi,a*) = poo(a;,a*), 

n—> oo 77, L ' 

2—0 

p- a.e. and p- a.e., where a* denotes an infinite sequence of a*. Since p ^ is 
invariant we have from Lemma 6.3.1 of [50] and Fatou’s lemma that 

/ n— 1 

Ep,p(X 0 , a*) = E lim - V' p(X i} a*) 

\ n— >oo n ^ ' 

\ i—0 



. n— 1 

< liminf — EnpiXi, a*) < p*. 

n— >oo Tl L ' 



2—0 
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Performance and OPTA 

We next develop several basic properties of the performance and OPTA func- 
tions for block coding AMS sources with additive fidelity criteria. 

Lemma 11.2.2: Given two sources pi and p 2 and A £ (0,1), then for any 
block code C 



p{C , \pi + (1 - X)p 2 ) = A p(C, pi) + (1 - A )p(C, p 2 ) 
and for any N 

Sn(R, Api + (1 — X)p 2 ) > A Sn(R, pi) + (1 — A )Sn(R, p 2 ) 



and 

S(R, Xpi + (1 — X)p 2 ) > X5(R, pi) + (1 — A )5(R, p 2 ). 

Thus performance is linear in the source and the OPTA functions are convex 
P|. Lastly, 

Sn{R + -jy, Xpi + (1 — X)p 2 ) < X8n(R, pi) + (1 — X)8n{R, M 2 ) - 

Proof: The equality follows from the linearity of expectation since p(C,p) = 
E /Jj p(X N ,C). The first inequality follows from the equality and the fact that 
the infimum of a sum is bounded below by the sum of the infima. The next 
inequality follows similarly. To get the final inequality, let C,; approximately 
yield 5iy(R,pi)', that is, 



p{Ci,Pi) < S n (R, pi) + e. 



Form the union code C = C\ (J C 2 containing all of the words in both of the 
codes. Then the rate of the code is 

llog||C|| = IlogdlCrll + II^H) 

< ll 0 g(2^ + 2^) = l?.+ l. 

This code yields performance 

p(C, Xp 1 + (1 - A )p 2 ) = X p(C, pi) + (1 - A )p(C, p 2 ) 

< Xp(Ci, pi) + (1 — A)p(Ca, P2) < A 5n{R, pi) + Ae+ (1 — X)Sm(R , p 2 ) + (1 — A)e. 

Since the leftmost term in the above equation can be no smaller than 5n(R + 
1/N, Xpi + (1 — A )p 2 ), the lemma is proved. □ 

The first and last inequalities in the lemma suggest that Sn is very nearly 
an affine function of the source and hence perhaps S is as well. We will later 
pursue this possibility, but we are not yet equipped to do so. 
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Before developing the connection between the OPTA functions of AMS 
sources and those of their stationary mean, we pause to develop some addi- 
tional properties for OPTA in the special case of stationary sources. These 
results follow Kieffer [76]. 

Lemma 11.2.3: Suppose that y is a stationary source. Then 
S(R, y) = lim 6n(R,/-i). 

Af— »oo 

Thus the infimum over block lengths is given by the limit so that longer codes 
can do better. 

Proof: Fix an N and an n < N and choose codes C n C A" and Cn-u C A N ~ n 
for which 

Pifim y) f 8 n {R, y) T - 
pif'N—nilf) f $N— n (Ri ff) T ~ • 

Form the block length N code C = C n x C/v-n- This code has rate no greater 
than R and has distortion 

Np(C,n) = Eminp N (X N ,y) 



= E yn&Cn Pn(X n ,y n ) + E vN - neeN _ n p N _ n (X^- n ,v N ~ n ) 

= E y n eCnPn (X n , y n ) + E v *-n eCtf _ nPN _ n (X N - n , /-) 

= n P (C n , n) + (N - n) P (C N - n , y) 

< n5 n (R , y) + (N - n)S N - n (R, y) + e, (11.4) 

where we have made essential use of the stationarity of the source. Since e is 
arbitrary and since the leftmost term in the above equation can be no smaller 
than NSn(R, y), we have shown that 

N5 n (R , y) < nS n (R , y) + (N - ri)5 N - n (R , y) 

and hence that the sequence N8n is subadditive. The result then follows im- 
mediately from Lemma 7.5.1 of [50]. □ 

Corollary 11.2.2: If y is a stationary source, then S(R, y) is a convex (J 
function of R and hence is continuous for R > 0. 

Proof: Pick Ri > i? 2 and A € (0, 1). Define R = \R± + (1 — A )i? 2 - For large 
n define n\ = [An] be the largest integer less than An and let n 2 = n — n\. Pick 
codebooks Ci C A ni with rate Ri with distortion 

p{Ci, y) < S ni (Ri,y) + e. 

Analogous to (11.4), for the product code C = C\ x C 2 we have 



n P (C, y) = ni P (Ci,y) + n 2P (C 2 , y) 
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< ni<5„i(i?i,/x) + n 2 6 n2 (R 2 ,n ) + ne. 

The rate of the product code is no greater than R and hence the leftmost term 
above is bounded below by n5 n (R 1 /z). Dividing by n we have since e is arbitrary 
that 

$n(R,n) < — <5 ni (f?i , /i) + —S n2 (R 2 ,ii). 
n n 

Taking n-*oowe have using the lemma and the choice of rij that 

6(R,h) < \5(Ri,[i) + (1 — X)6(R 2 ,h), 

proving the claimed convexity. □ 

Corollary 11.2.3: If /z is stationary, then S(R,fi) is an affine function of /j. 
Proof: From Lemma 11.2.2 we need only prove that 

Xfii + (1 — A)/z 2 ) < XS(R, hi) + (1 — X)5(R, /Z 2 ) . 

From the same lemma we have that for any N 

Sn(R+ — , Xhi + (1 — A)/z 2 ) < XSn(R, hi) + (1 — A )Sn(R, /z 2 ) 

For any K < N we have since Sn(R,h) is nonincreasing in R that 

Sn{R+ + (1 — A)/z 2 ) < A Sn(R,hi) + (1 — X)5n(R, /z 2 ). 

Taking the limit as TV — > 00 yields from Lemma 11.2.3 that 
(5(i? + — , /z) < XS(R, Hi) + (1 ~ A)(5(i?., /z 2 ). 

K 

From Corollary 11.2.2, however, S is continuous in R and the result follows by 
letting K — > 00 . □ 

The following lemma provides the principal tool necessary for relating the 
OPTA of an AMS source with that of its stationary mean. It shows that the 
OPTA of an AMS source is not changed by shifting or, equivalently, by redefining 
the time origin. 

Lemma 11.2.4: Let h be an AMS source with a reference letter. Then for 
any integer i 5 {R,h) = S(R, hT~ 1 ). 

Proof: Fix e > 0 and let Cjv be a rate R block length N codebook for which 
p(Cn, h) < S(R, h) + e/2. For 1 < i < N — 1 choose J large and define the block 
length K = JN code C /<-(«) by 

C K (i) = a* (Ar-i) x J x 2 C N x a*\ 

3 = 0 

where a* 1 is an /-tuple containing all a*’ s. Cx(i) can be considered to be a code 
consisting of the original code shifted by i time units and repeated many times, 
with some filler at the beginning and end. Except for the edges of the long 
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product code, the effect on the source is to use the original code with a delay. 
The code has at most (2 NR ) J ~ 1 = 2 KR 2~ NR words; the rate is no greater than 

R. 

For any A'-block x K the distortion resulting from using C k 1 is given by 



Kpk{x ,Ck(i)) < {N - i)pN-i{ ■ 



N—i *(N—i) 



) + ip i {x l K _ i ,a*'). (11.5) 



Let {x n } denote the encoded process using the block code Cr-(z). If n is a 
multiple of K, then 

LirJ 

np n (x n ,x n ) < -i)p N _i(x^\a* {N ^ l) ) 



k—0 






+ipi{x\ k+1 ) K -i, a * )) + E N pN (a;jv-i+fcjv > Cn ) ■ 

k = o 

If n is not a multiple of K we can further overbound the distortion by including 
the distortion contributed by enough future symbols to complete a AT-block, 
that is, 

np n (x n ,x n ) < rvy n {x,x) 

LyJ + 1 

= E \( N - i )PN-i(XkK Z ’ a * iN ~ t ) + l Pi i X \k+l)K-i, a * 



k - 0 



(L^J 



+ E Np N (x%_ i+kN ,Cw)- 



Thus 



p n (x n ,X n ) < 



fc= 0 



K n/K 



AT ' 1 LkJJ" 1 

Ty T L \ ^ / \rN—i (rpkK \ *(N—i\ 

2^ pN-i{X [T x),a v ) 



k = 0 



+ T^Jk H 

' k—0 

+ C 7 kt E P N (X N (T^ +kN x),C N ). 



LR-J+ 1 



i/N 



fc= o 



Since p is AMS these quantities all converge to invariant functions: 

AT • -1 1 

lim p„(x n ,i n ) < — E lim — V] p N -i(X N ^(T kK x), 

' K m—>oo rn z ' 






1 in — j. 

+ _L l im -Vp.frfTW-h.a*') 

K rn—> oo m z ' 



k—0 
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+ lim -Y,PN(X N (T( N ~V +kN x),C N ). 

m—*oo m z ' 
k = 0 

We now apply Fatou’s lemma, a change of variables, and Lemma 11.2.1 to 
obtain 

S{R,iiT- i )<p{C K {i)^T- i ) 



■ limsup — V' E uT -ip N -i(X N l T 

™ _ m L — * 



N-irpkK *{N—i)\ 



+ K lim -Y, E ur->Pi( xiT{k+1)K ~ i ,a* i ) 

]\ m—> oo 777, z ' 



+£ M r-‘ lim -^p w (X w T( iV - i )+ fciv ),C w ). 
k—0 

< P * + ^ P * + E„ lim 1 53 pN{X N T kN CN)- < Ap* + p(c w , /z). 

fc=i 

Thus if J and hence A' are chosen large enough to ensure that IV/ AT < e/2, then 

5(R, pT~ i ) < 6(R, p), 

which proves that S(R, pT~ l ) < S(R,p). The reverse implication is found in 
a similar manner: Let Cn be a codebook for pT~ l and construct a codebook 
Cx(N — i) for use on p. By arguments nearly identical to those above the reverse 
inequality is found and the proof completed. □ 

Corollary 11.2.4: Let p be an AMS source with a reference letter. Fix 
N and let p and px denote the stationary and ./V-stationary means. Then for 
R > 0 

S(R, p) = S(R, pnT-% i = 0, 1, • • • , N — 1. 

Proof: It follows from the previous lemma that the 6(R , pnT~ l ) are all equal 
and hence it follows from Lemma 11.2.2, Theorem 7.3.1 of [50], and Corollary 
7.3.1 of [50] that 

i 

S(R,p) > n 53 S{R,p N T ~ l ) = 5(R, p N ). 

i—0 

To prove the reverse inequality, take p = px in the previous lemma and 
construct the codes Cx(i) as in the previous proof. Take the union code 
Ck = U t =o ' having block length K and rate at most R+ K~ 1 \ogN. 

We have from Lemma 11.2.1 and (11.5) that 



p(C Kl p) = -T7 5Z p( c k,PnT l ) 
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^ J ^ jy 

- ]v ^ MkW^nT- 1 ) < P* + p(Cn,An) 

i = 0 

and hence as before 



S(R+ log N, n) < 6(R,fj, N ). 

From Corollary 11.2.1 S(R, p) is continuous in R for R > 0 since p is stationary. 
Hence taking J large enough yields S(R,p) < S(R,Pn). This completes the 
proof since from the lemma S(R,PnT~ 1 ) = S(R,Pn). □ 

We are now prepared to demonstrate the fundamental fact that the block 
source coding OPTA function for an AMS source with an additive fidelity cri- 
terion is the same as that of the stationary mean process. This will allow us to 
assume stationarity when proving the actual coding theorems. 

Theorem 11.2.2: If p is an AMS source and {p n } an additive fidelity 
criterion with a reference letter, then for R > 0 

5(R,n) = S(R,p). 

Proof: We have from Corollaries 11.2.1 and 11.2.4 that 

S(R,p) < 8 (R,Pn) < 8n(R,Pn ) = 8n(R,p)- 
Taking the infimum over N yields 

S(R, p) < S(R, / 1 ). 

Conversely, fix e > 0 let Cn be a block length N codebook for which p(Cn,P ) 
< S(R,p) + e. From Lemma 11.2.1, Corollary 11.2.1, and Lemma 11.2.4 

1 iv " 1 

S(R,p) + e< p(C N ,p) = — p(C n ,PnT~ 1 ) 

i = 0 

N-l N-l 

> x E ^ E Sn^pT-*) 

i — 0 i = 0 

N-l 

i = 0 

which completes the proof since e is arbitrary. □ 

Since the OPTA functions are the same for an AMS process and its sta- 
tionary mean, this immediately yields the following corollary from Corollary 
11.2.2! 

Corollary 11.2.5: If p is AMS, then 8(R 1 p) is a convex function of R and 
hence a continuous function of R for R > 0. 
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11.3 Block Coding Stationary Sources 

We showed in the previous section that when proving block source coding the- 
orems for AMS sources, we could confine interest to stationary sources. In this 
section we show that in an important special case we can further confine inter- 
est to only those stationary sources that are ergodic by applying the ergodic 
decomposition. This will permit us to assume that sources are stationary and 
ergodic in the next section when the basic Shannon source coding theorem is 
proved and then extend the result to AMS sources which may not be ergodic. 

As previously we assume that we have a stationary source {X n } with distri- 
bution p and we assume that {p n } is an additive distortion measure and there 
exists a reference letter. For this section we now assume in addition that the 
alphabet A is itself a Polish space and that pi(r,y) is a continuous function of 
r for every y £ A. If the underlying alphabet has a metric structure, then it 
is reasonable to assume that forcing input symbols to be very close in the un- 
derlying alphabet should force the distortion between either symbol and a fixed 
output to be close also. The following theorem is the ergodic decomposition of 
the block source coding OPTA function. 

Theorem 11.3.1: Suppose that p is the distribution of a stationary source 
and that {p n } is an additive fidelity criterion with a reference letter. Assume 
also that pi(-,y) is a continuous function for all y. Let {p x } denote the ergodic 
decomposition of p. Then 



S(R,p) = j dp(x)S(R,p x ), 

that is, 5(R,p) is the average of the OPTA of its ergodic components. 

Proof: Analogous to the ergodic decomposition of entropy rate of Theorem 
2.4.1, we need to show that 6(R 1 p) satisfies the conditions of Theorem 8.9.1 of 
[50]. We have already seen (Corollary 11.2.3) that it is an affine function. We 
next see that it is upper semicontinuous. Since the alphabet is Polish, choose a 
distance dg on the space of stationary processes having this alphabet with the 
property that Q is constructed as in Section 8.2 of [50]. Pick an N large enough 
and a length N codebook C so that 

S(RiP) > 6 N (R,p) - | > p N {C,p) - e. 

Pn{x n , y) is by assumption a continuous function of x N and hence so is p^(x N ,C) 
min^gc p(x N , y). Since it is also nonnegative, we have from Lemma 8.2.4 of [50] 
that if p n — » p then 

limsup E /Xn p N (X N ,C) < E fl p N (X N ,C). 

71— ► OO 

The left hand side above is bounded below by 

limsup 6n(R, p n ) > limsup 5(R, p n ). 
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Thus since e is arbitrary, 



limsup<5(-R, p n ) < S(R,p) 

n—> oo 

and hence S(R,p) upper semicontinuous in p and hence also measurable. Since 
the process has a reference letter, 5(R,p x ) is integrable since 

S(R,p x ) < S N (R,Hx ) < ^ x pi(X 0 ,a*) 

which is integrable if pi(xo,a*) is from the ergodic decomposition theorem. 
Thus Theorem 8.9.1 of [50] yields the desired result. □ 

The theorem was first proved by Kieffer [76] for bounded continuous additive 
distortion measures. The above extension removes the requirement that p\ be 
bounded. 



11.4 Block Coding AMS Ergodic Sources 

We have seen that the block source coding OPTA of an AMS source is given by 
that of its stationary mean. Hence we will be able to concentrate on stationary 
sources when proving the coding theorem. 

Theorem 11.4.1: Let p be an AMS ergodic source with a standard alphabet 
and {pn} an additive distortion measure with a reference letter. Then 

S(R,p) = D(R,p ), 

where p is the stationary mean of p. 

Proof: From Theorem 11.2.2 S(R,p) = 6(R,p) and hence we will be done if 
we can prove that 

S(R,p) = D(R 1 p). 

This will follow if we can show that 6(R 1 p) = D(R, p) for any stationary ergodic 
source with a reference letter. Henceforth we assume that p is stationary and 
ergodic. 

We first prove the negative or converse half of the theorem. First suppose 
that we have a codebook C such that 



p N (C,p) = E^ min p N (X N , y) = S N (R,p) + e. 



v&c 



If we let Xn denote the resulting reproduction random vector and let p N denote 
the resulting joint distribution of the input /output pair, then since X N has a 
finite alphabet, Lemma 5.5.6 implies that 

I(X N ; X N ) < H(X n ) < NR 
and hence p N € TZn{R, p N ) and hence 

Sn(R, M) + e > E pN p N (X N - X N ) > D n (R , p). 
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Taking the limits as N — > oo proves the easy half of the theorem: 

S(R, /n) > D(R, p). 

(Recall that both OPTA and distortion rate functions are given by limits if the 
source is stationary.) 

The fundamental idea of Shannon’s positive source coding theorem is this: 
for a fixed block size N, choose a code at random according to a distribution 
implied by the distortion-rate function. That is, perform 2 NR independent ran- 
dom selections of blocks of length N to form a codebook. This codebook is then 
used to encode the source using a minimum distortion mapping as above. We 
compute the average distortion over this double-random experiment (random 
codebook selection followed by use of the chosen code to encode the random 
source). We will find that if the code generation distribution is properly chosen, 
then this average will be no greater than D(R, p) + e. If the average over all 
randomly selected codes is no greater than D(R,p) + e, however, than there 
must be at least one code such that the average distortion over the source dis- 
tribution for that one code is no greater than D(R,p) + e. This means that 
there exists at least one code with performance not much larger than D{R,p). 
Unfortunately the proof only demonstrates the existence of such codes, it does 
not show how to construct them. 

To find the distribution for generating the random codes we use the er- 
godic process definition of the distortion-rate function. From Theorem 10.6.1 
(or Lemma 10.6.3) we can select a stationary and ergodic pair process with 
distribution p which has the source distribution /i as one coordinate and which 
has 

E pP (X o, Y 0 ) = 1 E pN p N (X N , Y n ) < D{R, p) + e (11.6) 

and which has 

I P (X-Y) = E(X-Y) <R (11.7) 

(and hence information densities converge in L 1 from Theorem 6.3.1). Denote 
the implied vector distributions for (X N ,Y N ), X N , and Y N by p N , p N , and 
i] N , respectively. 

For any N we can generate a codebook C at random according to r/' v as 
described above. To be precise, consider the random codebook as a large random 
vector C = ( W 0 , Wi, • • • , Wm), where M = [ e JV ( fl +<0j (where natural logarithms 
are used in the definition of i?), where Wq is the fixed reference vector a* N and 
where the remaining W n are independent, and where the marginal distributions 
for the W„ are given by r/ ,v . Thus the distribution for the randomly selected 
code can be expressed as 

M AT 

Pc = X 1 ) N . 

i= 1 

This codebook is then used with the optimal encoder and we denote the resulting 
average distortion (over codebook generation and the source) by 

A at = Ep(C, p) = j dP c (W)p(W, p) (11.8) 
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where 



= ^E PN (X N ,W) = ^ j dp N (x N )p N (x N ,W), 



and where 

p N (x N ,C) = min p N (ar N , y). 

yeC 

Choose <5 > 0 and break up the integral over x into two pieces: one over a 
set Gm = {x : N~ 1 p N (x N , a* N ) < p* + 5} and the other over the complement 
of this set. Then 

A N < f ~^pn{x n ,a* N ) dp N (x N ) 

+ ^j dP c( W )j G dp N (x N )p N (x N ,W), (11.9) 

where we have used the fact that pn(x n ,mW) < pn(x n ,a* N ). Fubini’s theo- 
rem implies that because 



J dp N {x n )pn{x n ,a* N ) < oo 



Pn{x n ,W) < Pn{x N , a* N ), 

the limits of integration in the second integral of (11.9) can be interchanged to 
obtain the bound 

Aat < [ p N (x N , a* N )dp N (x N ) 

iv JG% 

+ ^j G dp N {x N ) J dPc(W) P N{x N ,W) (11.10) 

The rightmost term in (11.10) can be bound above by observing that 
^ dp N (x N )ij dP c (VV)p N (x N , W)] 

= 4/ dp N (x N )[[ dP c (W)p N (x N ,W) 

JV JGn JC-. PN (x n ,C)<N(D+ 8 ) 



( 11 . 10 ) 



•:p N (x N ,C)<N(D+S) 



dP c {W)p N {x N ,W) 



IW:p N (x N ,W)>N(D+S) 



dP c (W) PN {x N ,W)\ 



<[ dp N (x N )[D + 6+Up* + 6) [ d Pc (W)] 

JG n iv JW:p N (x N ,W)>N{D+S) 

where we have used the fact that for x £ G the maximum distortion is given by 
p* + S. Define the probability 



P(N~ 1 p N (x N ,C) > D + S\x N ) = f 

JW: 



W:p N (x N ,W)>N(D+ 6 ) 



dp c (W) 
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and summarize the above bounds by 

A N < D + 6 + {p* + 5)^ j dp N {x N )P{N~ 1 p N {x N ,C ) >D + 5\x N ) 

+ f dp N (x N )p N (x N ,a* N ). (11-11) 

iV Jg% 

The remainder of the proof is devoted to proving that the two integrals above 
go to 0 as ./V — > oo and hence 

lim sup Ajv < D + 8. (11.12) 

N—*oo 

Consider first the integral 

aN = Nr f dp N (x N )p N (x N ,a* N ) 

= j dp N {x N )l G% (x N )^p N {x N ,a* N ). 

We shall see that this integral goes to zero as an easy application of the ergodic 
theorem. The integrand is dominated by N~ 1 p^(x N ,a* N ) which is uniformly 
integrable (Lemma 4.7.2 of [50]) and hence the integrand is itself uniformly 
integrable (Lemma 4.4.4 of [50]). Thus we can invoke the extended Fatou lemma 
to conclude that 

lim sup oat < [ dp N (x N ) limsup { l G c {x N ) -^=Pn(x n , a* N ) 
oo J oo V N N 

< [ d/i N (x N )(limsupl G c (x N ))(limsup ^-p N (x N 

J N->oo N->oo -W 

We have, however, that lim sup.y^^ 1c= n (x n ) is 0 unless x N € G C N i.o. But this 
set has measure 0 since with probability 1, an x is produced so that 

i w -i 

J im ir r J2p(xi,a*)=p* 

AT— >oo N z ' 
i = 0 

exists and hence with probability one one gets an x which can yield 

N~ 1 p N {x N ,a* N )> p* + S 

at most for a finite number of N. Thus the above integral of the product of a 
function that is 0 a.e. with a dominated function must itself be 0 and hence 

limsupajv = 0. (11.13) 

N—>oo 

We now consider the second integral in (11.11): 

b N = (p* + S)^ J dp N (x N )P(N~ 1 p N (x N ,C) > D + S\x N ). 
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Recall that P(pn(x n ,C) > D+S\x N ) is the probability that for a fixed input 
block x N , a randomly selected code will result in a minimum distortion codeword 
larger than D + 6. This is the probability that none of the M words (excluding 
the reference code word) selected independently at random according to to the 
distribution r/ N lie within D + 6 of the fixed input word x N . This probability 
is bounded above by 

p (^Pn{x n ,C) > D + S\x N ) < [l-r] N (^p N (x N ,Y N ) < D + S)] m 
where 

r} N {^ T P N {x N ,Y N )<D + 5))= f drj N (y N ). 

™ Jy N :-ffPN(x N ,y N )<D+6 

Now mutual information comes into the picture. The above probability can be 
bounded below by adding a condition: 

P N (^ Pn (x n ,Y n )<D + S) 



where 



where 



> V N {^Pn(x n ,Y N ) < D + 6 and ~^i N (x N , Y N ) < R + 6), 



^i N (x N ,y N ) = ^ln f N (x N ,y N ), 



f N (x",y N ) = 



dp N (x N , y N ) 



d{p N x r] N )(x N , y N ) ’ 



the Radon-Nikodym derivative of p N with respect to the product measure p N x 
i) N . Thus we require both the distortion and the sample information be less 
than slightly more than their limiting value. Thus we have in the region of 
integration that 



^i N {x N ; y N ) = In f N (x N , y N ) <R + 6 



and hence 

Vn(pn(x n ,Y n ) < D + 6) > [ drj N (y N ) 

J y N :pw(x N ,y N )<D+5,fN(x N ,y N )<e N ( R + 5 ) 



>e-W) f d V N (y N )f N (x N ,y N ) 

J y N :pN (x N ,y N )<D-\-5,fM (x N ,y N )<e N ( R + 5 ) 

which yields the bound 

p (^Pn{x n ,C) > D + 6\x n ) < [1-77 N (^P N (x N ,Y N ) < D + 6)] M 




11.4. BLOCK CODING AMS ERGODIC SOURCES 



227 



< [1 - e ~ N ( R +V [ 

Jv N -.4r, 



ty N ■ ir PK 0 N 'V N ) <D+S, i i N (x N ,y N )< R+8 

Applying the inequality 



d V N (y N )f N (x N ,y N )} M , 



for a, (3 € [0, 1] yields 



(1 - a/3) M < 1 - 0 + . 



P{- Pn {x\C)>D + 5\x n )< 



1 - / d V N (y N ) 

J y N '-jjPN{x N ,y N )<D+S,j r i N (x N ,y N )<R+8 

x f N {x N ,y N ) _|_ e [- Me_w(R+5) ]_ 

Averaging with respect to the distribution y N yields 

p^+S = I d d N ( xN ) P (PN(x N > C ) > D + d\x N ) 



< J dp N {x N ) ^1 - 



= 1 - 



dy N (y N ) 

:pN (x N ,y N ) < N(D+S ) , ^ ijv (x N ,y N ) <R+S 
J V N -j}-pN(x N ,y N )<D+S,^-i N (x N ,y N )<R+S 

x f N (x N , y N ) + e ~ Me ~ N< ' R+S) 

= 1 + e -Me~»™_ f dp N (x N ,y N ) 

d y N ■j T PnO n ,y N )<D+8,^i N {x N ,y N )<R+ 8 

= 1 e _ Me -"(«+‘) 

V'fo" : y N ) < D + <5, y") < i? + <5). (11.14) 

Since M is bounded below by e N< ' R+e ' ) — 1, the exponential term is bounded 
above by 

r rf'MYR-L.:'* _ WC R_L A t . _ W/ R_L A ^ i r M ( * — . 7VY R_L^t n 



If e > 6, this term goes to 0 as N — > oo. 

The probability term in (11.14) goes to 1 from the mean ergodic theorem 
applied to p\ and the mean ergodic theorem for information density since mean 
convergence (or the almost everywhere convergence proved elsewhere) implies 
convergence in probability. This implies that 



limsup 6 jv = 0 

n—> oo 
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which with (11.13) gives (11.12). Choosing an N so large that A^r < S, we 
have proved that there exists a block code C with average distortion less than 
D(R 1 /i) + 5 and rate less than R + e and hence 

S(R+e,p)<D(R,p) + 6. (11.15) 

Since e and <5 can be chosen as small as desired and since D(R 1 p) is a continuous 
function of R (Lemma 10.6.1), the theorem is proved. □ 

The source coding theorem is originally due to Shannon [129] [130], who 
proved it for discrete i.i.cl. sources. It was extended to stationary and ergodic 
discrete alphabet sources and Gaussian sources by Gallager [43] and to station- 
ary and ergodic sources with abstract alphabets by Berger [10] [11], but an 
error in the information density convergence result of Perez [123] (see Kieffer 
[74]) left a gap in the proof, which was subsequently repaired by Dunham [35]. 
The result was extended to nonergodic stationary sources and metric distortion 
measures and Polish alphabets by Gray and Davisson [53] and to AMS ergodic 
processes by Gray and Saadat [61]. The method used here of using a stationary 
and ergodic measure to construct the block codes and thereby avoid the block 
ergodic decomposition of Nedoma [106] used by Gallager [43] and Berger [11] 
was suggested by Pursley and Davisson [29] and developed in detail by Gray 
and Saadat [61]. 

11.5 Subadditive Fidelity Criteria 

In this section we generalize the block source coding theorem for stationary 
sources to subadditive fidelity criteria. Several of the interim results derived 
previously are no longer appropriate, but we describe those that are still valid 
in the course of the proof of the main result. Most importantly, we now con- 
sider only stationary and not AMS sources. The result can be extended to 
AMS sources in the two-sided case, but it is not known for the one-sided case. 
Source coding theorems for subadditive fidelity criteria were first developed by 
Mackenthun and Pursley [96]. 

Theorem 11.5.1: Let p denote a stationary and ergodic distribution of a 
source {X n } and let {p n } be a subadditive fidelity criterion with a reference 
letter, i.e., there is an a* £ A such that 

Epi(X 0 ,a*) = p* < oo. 

Then the OPTA for the class of block codes of rate less than R is given by the 
Shannon distortion-rate function D(R,p). 

Proof: Suppose that we have a block code of length N, e.g., a block encoder 
a : A N — > B k and a block decoder [3 : B K — > A N . Since the source is stationary, 
the induced input /output distribution is then iV-stationary and the performance 
resulting from using this code on a source p is 

A at = E pPoo = ^E pPn (X n ,X n ), 
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where {A^} is the resulting reproduction process. Let 6n(R,p) denote the 
infhnum over all codes of length N of the performance using such codes and let 
8(R,fi) denote the infhnum of 5n over all N, that is, the OPTA. We do not 
assume a codebook/minimum distortion structure because the distortion is now 
effectively context dependent and it is not obvious that the best codes will have 
this form. Assume that given an e > 0 we have chosen for each N a length N 
code such that 

Sn(R, h) > Aat — e. 

As previously we assume that 

AT log ||.B|| ^ n 
N ~ R ' 

where the constraint R is the rate of the code. As in the proof of the converse 
coding theorem for an additive distortion measure, we have that for the resulting 
process I(X N ;X N ) < RN and hence 

Ajv > D n (R , fi). 

From Lemma 10.6.2 we can take the infhnum over all N to find that 
8(R , n) = inf Sn{R , p) > inf Dn(R, p) — e = D(R , p) — e. 

Since e is arbitrary, 8(R,iY) < D(R,fT), proving the converse theorem. 

To prove the positive coding theorem we proceed in an analogous manner 
to the proof for the additive case, except that we use Lemma 10.6.3 instead of 
Theorem 10.6.1. First pick an N large enough so that 

Dn{R , A) A D(R, /.I) + - 

and then select a p N £ TZn{R, L N ) such that 

E pN ^p N (X N ,Y N )<D N (R,n)+ S -<D(R,n) + 8. 

Now then construct as in Lemma 10.6.3 a stationary and ergodic process p 
which will have (10.6.4) and (10.6.5) satisfied (the right IVth order distortion 
and information). This step taken, the proof proceeds exactly as in the additive 
case since the reference vector yields the bound 

1 1 Ar ' 1 
Pn(x , a ) < — 2_^ Pi{xi,a ), 
i= 0 

which converges, and since N^ 1 ppj(x N , y N ) converges as N — » oo with p prob- 
ability one from the subadditive ergodic theorem. Thus the existence of a code 
satisfying (11.15) can be demonstrated (which uses the minimum distortion en- 
coder) and this implies the result since D(R,p) is a continuous function of R 
(Lemma 10.6.1). □ 
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11.6 Asynchronous Block Codes 

The block codes considered so far all assume block synchronous communication, 
that is, that the decoder knows where the blocks begin and hence can deduce 
the correct words in the codebook from the index represented by the channel 
block. In this section we show that we can construct asynchronous block codes 
with little loss in performance or rate; that is, we can construct a block code 
so that a decoder can uniquely determine how the channel data are parsed and 
hence deduce the correct decoding sequence. This result will play an important 
role in the development in the next section of sliding block coding theorems. 

Given a source fi let (>async(A A*) denote the OPTA function for block codes 
with the added constraint that the decoder be able to synchronize, that is, 
correctly parse the channel codewords. Obviously 

^async (A A*) > ^(Aa*) 

since we have added a constraint. The goal of this section is to prove the 
following result: 

Theorem 11.6.1: Given an AMS source with an additive fidelity criterion 
and a reference letter, 

<5async(A A 1 ) = <KAaA 

that is, the OPTA for asynchronous codes is the same as that for ordinary codes. 

Proof: A simple way of constructing a synchronized block code is to use a 
prefix code: Every codeword begins with a short prefix or source synchronization 
word or, simply, sync word, that is not allowed to appear anywhere else within 
a word or as any part of an overlap of the prefix and a piece of the word. The 
decoder than need only locate the prefix in order to decode the block begun by 
the prefix. The insertion of the sync word causes a reduction in the available 
number of codewords and hence a loss in rate, but ideally this loss can be made 
negligible if properly done. We construct a code in this fashion by finding a good 
codebook of slightly smaller rate and then indexing it by channel A'-tuples with 
this prefix property. 

Suppose that our channel has a rate constraint R, that is, if source fV-tuples 
are mapped into channel A'-tuples then 

K\og\\B\\ ^ n 
N - R ' 

where B is the channel alphabet. We assume that the constraint is achievable 
on the channel in the sense that we can choose N and K so that the physical 
stationarity requirement is met ( N source time units corresponds to K channel 
time units) and such that 

\\B\\ k « e NR , (11.16) 

at least for large N. 

If K is to be the block length of the channel code words, let 6 be small and 
define fc(AT) = [y>Arj + 1 and consider channel codewords which have a prefix 
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of k(K) occurrences of a single channel letter, say b , followed by a sequence of 
K — k(K ) channel letters which have the following constraint: no fc(if)-tuple 
beginning after the first symbol can be b k<K K We permit Irs to occur at the end 
of a if -tuple so that a fc(if)-tuple of 6’s may occur in the overlap of the end of 
a codeword and the new prefix since this causes no confusion, e.g., if we see an 
elongated sequence of b’s, the actual code information starts at the right edge. 
Let M(K ) denote the number of distinct channel if-tuples of this form. Since 
M(if) is the number of distinct reproduction codewords that can be indexed 
by channel codewords, the codebooks will be constrained to have rate 



Rk = 



In M(K) 
N ' 



We now study the behavior of Rk as K gets large. There are a total of 
\\B\\ K -k( K ) if-tuples having the given prefix. Of these, no more than (if — 
k(K))\\B\\ K ~ 2k ( K } have the sync sequence appearing somewhere within the 
word (there are fewer than K — k(K) possible locations for the sync word 
and for each location the remaining K — 2k(K) symbols can be anything). 
Lastly, we must also eliminate those words for which the first i symbols are b 
for i = 1, 2, • • • , k(K) — 1 since this will cause confusion about the right edge of 
the sync sequence. These terms contribute 

k(K)~ 1 

| |£J| \K—k(K) — i 

i—1 



bad words. Using the geometric progression formula to sum the above series we 
have that it is bounded above by 

\\ B yK-k(K)-\ 

1-1/llSH ' 

Thus the total number of available channel vectors is at least 

M(K) > || B\\ K ~ k ^ - (if - fc(if))||B||*- 2fc (*> - 11 ^ . 



Thus 



Rk = ^ In \\B\\ K ~ k ™ + _L ln _ {K - k{K))\\B\\~ klK) - 

= K ~^ K) ln||B|| + lln (p|pf - (K-k(K))\\B\\- k ^ . 

> (1 — 6)R + o(N), 

where o(N) is a term that goes to 0 as N (and hence K) goes to infinity. Thus 
given a channel with rate constraint R and given e > 0, we can construct for N 
sufficiently large a collection of approximately channel if-tuples (where 

K « NR) which are synclrronizable, that is, satisfy the prefix condition. 
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We are now ready to construct the desired code. Fix 5 > 0 and then choose 
e > 0 small enough to ensure that 

S(R(1 - e),n) < S(R,g) + ^ 

(which we can do since S(R, g) is continuous in R). Then choose an N large 
enough to give a prefix channel code as above and to yield a rate R — e codebook 
C so that 

Pn(C, g) < S N (R -e,g) + - 
2 S 

< S(R — e, g) + — < S(R, g) + S. (11.17) 

O 

The resulting code proves the theorem. □ 

11.7 Sliding Block Source Codes 

We now turn to sliding block codes. For simplicity we consider codes which 
map blocks into single symbols. For example, a sliding block encoder will be a 
mapping / : A N — > B and the decoder will be a mapping g : B K — » A. In the 
case of one-sided processes, for example, the channel sequence would be given 

by 

Un = f{X?) 

and the reproduction sequence by 

X n =g(UZ). 

When the processes are two-sided, it is more common to use memory as well 
as delay. This is often done by having an encoder mapping / : A 2N+1 — > B , 
a decoder g : B 2L+1 — > A, and the channel and reproduction sequences being 
defined by 

U n = f(X_ N ,---,X o,---,X N ), 

X n = g{U_L, • • • , U 0 , • • • , Un). 

We shall emphasize the two-sided case. 

The final output can be viewed as a sliding block coding of the input: 

X n — dif {.X n —i J —N 7***7 X n —i J j r N ) 7 * * * 7 f (-Xn+L— TV 7 * * * 7 -^n+L+iV )) 

= gf)X n _( JV+L), * * * , ^n+(iV+i))i 

where we use gf to denote the overall coding, that is, the cascade of g and /. 
Note that the delay and memory of the overall code are the sums of those for 
the encoder and decoder. The overall window length is 2 (N + L) + 1 

Since one channel symbol is sent for every source symbol, the rate of such a 
code is given simply by R = log ||I?|| bits per source symbol. The obvious prob- 
lem with this restriction is that we are limited to rates which are logarithms of 




11.7. SLIDING BLOCK SOURCE CODES 



233 



integers, e.g., we cannot get fractional rates. As previously discussed, however, 
we could get fractional rates by appropriate redefinition of the alphabets (or, 
equivalently, of the shifts on the corresponding sequence spaces). For example, 
regardless of the code window lengths involved, if we shift l source symbols to 
produce a new group of k channel symbols (to yield an (l, fc)-stationary encoder) 
and then shift a group of k channel symbols to produce a new group of k source 
symbols, then the rate is 

R= ylog||B|| 

bits or nats per source symbol and the overall code fg is ^-stationary. The 
added notation to make this explicit is significant and the generalization is 
straightforward; hence we will stick to the simpler case. 

We can define the sliding block OPTA for a source and channel in the natural 
way. Suppose that we have an encoder / and a decoder g. Define the resulting 
performance by 

PifdiP) = E nfgPooi 

where gfg is the input/output hookup of the source g connected to the deter- 
ministic channel fg and where poo is the sequence distortion. Define 

4®c (R,p) = inf p(fg,g) = A *(p,£,u,V), 

f,9 

where £ is the class of all finite length sliding block encoders and V is the 
collection of all finite length sliding block decoders. The rate constraint R is 
determined by the channel. 

Assume as usual that p is AMS with stationary mean p. Since the cascade of 
stationary channels fg is itself stationary (Lemma 9.4.7), we have from Lemma 
9.3.2 that pfg is AMS with stationary mean gfg. This implies from (10.10) 
that for any sliding block codes / and g 

E/J-fgP OO E^fgP OO 

and hence 

<5sbc (R,p) = 5 sbc(R,p)- 
A fact we now formalize as a lemma. 

Lemma 11.7.1: Suppose that p is an AMS source with stationary mean p 
and let {p n } be an additive fidelity criterion. Let 6sbc(R>p) denote the sliding 
block coding OPTA function for the source and a channel with rate constraint 
R. Then 

$sbc(R,p) = <5sbc (R,P)- 

The lemma permits us to concentrate on stationary sources when quantifying 
the optimal performance of sliding block codes. 

The principal result of this section is the following: 

Theorem 11.7.1: Given an AMS and ergodic source p and an additive 
fidelity criterion with a reference letter, 



$sbc{R, p) = 8(R,g), 
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that is, the class of sliding block codes is capable of exactly the same performance 
as the class of block codes. If the source is only AMS and not ergodic, then at 
least 

5aBc(R,l*)>S{R,ii), (11.18) 

Proof: The proof of (11.18) follows that of Shields and Neuhoff [133] for the 
finite alphabet case, except that their proof was for ergodic sources and coded 
only typical input sequences. Their goal was different because they measured the 
rate of a sliding block code by the entropy rate of its output, effectively assuming 
that further almost-noiseless coding was to be used. Because we consider a fixed 
channel and measure the rate in the usual way as a coding rate, this problem 
does not arise here. From the previous lemma we need only prove the result for 
stationary sources and hence we henceforth assume that g is stationary. We first 
prove that sliding block codes can perform no better than block codes, that is, 
(11.18) holds. Fix S > 0 and suppose that / : A 2N+1 — » B and g : B 2L+1 — > A 
are finite-length sliding block codes for which 

p(fg,p) < <5sbc (R,p) + & 

This yields a cascade sliding block code fg : yt 2 ( Ar + i )+ 1 _ * A which we use to 
construct a block codebook. Choose K large (to be specified later). Observe 
an input sequence x n of length n = 2 (IV + L) + 1 + I\ and map it into a 
reproduction sequence x n as follows: Set the first and last ( N + L) symbols 
to the reference letter a*, that is, Xq +l = x^An-l = a* ( - N+L ' > . Complete the 
remaining reproduction symbols by sliding block coding the source word using 
the given codes, that is, 

&i = f i = N + L+l,---,K + N + L. 

Thus the long block code is obtained by sliding block coding, except at the 
edges where the sliding block code is not permitted to look at previous or future 
source symbols and hence are filled with a reference symbol. Call the resulting 
codebook C. The rate of the block code is less than R = log ||B|| because n 
channel symbols are used to produce a reproduction word of length n and hence 
the codebook can have no more that ||B|| n possible vectors. Thus the rate 
is log 1 1 B 1 1 since the codebook is used to encode a source n-tuple. Using this 
codebook with a minimum distortion rule can do no worse (except at the edges) 
than if the original sliding block code had been used and therefore if X, is the 
reproduction process produced by the block code and Y t that produced by the 
sliding block code, we have (invoking stationarity) that 

N+L-l 

np{C,g)<E( Y p(Xi,a*))+ 

i = 0 



K+iV+L K+2(L+N) 

E( J2 p{Xi,Yf)) + E{ Y, p( x i,a*)) 

i=N+L i=K+N+L+l 
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< 2 (N + L)p * + K(S sbc (R, m) + 5) 

and hence 

s mtzrh U + m+v + K (SsBMll} + s> ■ 

By choosing 6 small enough and K large enough we can make make the right 
hand side arbitrarily close to <5 sbc(-RiM)> which proves (11.18). 

We now proceed to prove the converse inequality, 

6(R,p)>5sbc(R,»), (11-19) 

which involves a bit more work. 

Before carefully tackling the proof, we note the general idea and an “almost 
proof” that unfortunately does not quite work, but which may provide some 
insight. Suppose that we take a very good block code, e.g., a block code C of 
block length N such that 

p(C, p) < 8{R 1 p) + 5 

for a fixed 6 > 0. We now wish to form a sliding block code for the same channel 
with approximately the same performance. Since a sliding block code is just a 
stationary code (at least if we permit an infinite window length) , the goal can be 
viewed as “stationarizing” the nonstationary block code. One approach would 
be the analogy of the SBM channel: Since a block code can be viewed as a de- 
terministic block memoryless channel, we could make it stationary by inserting 
occasional random spacing between long sequences of blocks. Ideally this would 
then imply the existence of a sliding block code from the properties of SBM 
channels. The problem is that the SBM channel so constructed would no longer 
be a deterministic coding of the input since it would require the additional input 
of a random punctuation sequence. Nor could one use a random coding argu- 
ment to claim that there must be a specific (nonrandom) punctuation sequence 
which could be used to construct a code since the deterministic encoder thus 
constructed would not be a stationary function of the input sequence, that is, it 
is only stationary if both the source and punctuation sequences are shifted to- 
gether. Thus we are forced to obtain the punctuation sequence from the source 
input itself in order to get a stationary mapping. The original proofs that this 
could be done used a strong form of the Rohlin-Kakutani theorem of Section 9.5 
given by Shields [131]. [56] [58]. The Rohlin-Kakutani theorem demonstrates 
the existence of a punctuation sequence with the property that the punctuation 
sequence is very nearly independent of the source. Lemma 9.5.2 is a slightly 
weaker result than the strong form considered by Shields. 

The code construction described above can therefore be approximated by 
using a coding of the source instead of an independent process. Shields and 
Neuhoff [133] provided a simpler proof of a result equivalent to the Rohlin- 
Kakutani theorem and provided such a construction for finite alphabet sources. 
Davisson and Gray [27] provided an alternative heuristic development of a sim- 
ilar construction. We here adopt a somewhat different tack in order to avoid 
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some of the problems arising in extending these approaches to general alpha- 
bet sources and to nonergodic sources. The principal difference is that we do 
not try to prove or use any approximate independence between source and the 
punctuation process derived from the source (which is code dependent in the 
case of continuous alphabets). Instead we take a good block code and first pro- 
duce a much longer block code that is insensitive to shifts or starting positions 
using the same construction used to relate block coding performance of AMS 
processes and that of their stationary mean. This modified block code is then 
made into a sliding block code using a punctuation sequence derived from the 
source. Because the resulting block code is little affected by starting time, the 
only important property is that most of the time the block code is actually 
in use. Independence of the punctuation sequence and the source is no longer 
required. The approach is most similar to that of Davisson and Gray [27], but 
the actual construction differs in the details. An alternative construction may 
be found in Kieffer [79]. 

Given <5 > 0 and e > 0, choose for large enough N an asynchronous block 
code C of block length N such that 

llog||C||<i?-2e 

and 

p(C,p)<6(R,p) + 5. (11.20) 

The continuity of the block OPTA function and the theorem for asynchronous 
block source coding ensure that we can do this. Next we construct a longer 
block code that is more robust against shifts. For * = 0, 1 , • • • , A7 — 1 construct 
the codes Cxii) having length K = JN as in the proof of Lemma 11.2.4. These 
codebooks look like J — 1 repetitions of the codebook C starting from time i 
with the leftover symbols at the beginning and end being filled by the reference 
letter. We then form the union code Ck = (J; Cjc(i) as in the proof of Corollary 
11.2.4 which has all the shifted versions. This code has rate no greater than 
R — 2e+{JN)~ 1 log N. We assume that J is large enough to ensure that 

j^log N<e (11.21) 

so that the rate is no greater than R — e and that 

jP* < S. (11.22) 

We now construct a sliding block encoder / and decoder g from the given block 
code. From Corollary 9.4.2 we can construct a finite length sliding block code 
of {X n } to produce a two-sided (TV J, 7)-random punctuation sequence { Z n }. 
From the lemma P(Zq = 2) < 7 and hence by the continuity of integration 
(Corollary 4.4.2 of [50]) we can choose 7 small enough to ensure that 
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Recall that the punctuation sequence usually produces 0’s followed by NJ — 1 
l’s with occasional 2’s interspersed to make things stationary. The sliding block 
encoder / begins with time 0 and scans backward N J time units to find the first 
0 in the punctuation sequence. If there is no such 0, then put out an arbitrary 
channel symbol b. If there is such a 0, then the block codebook Ck is applied 
to the input A'-tuple x^ n to produce the minimum distortion codeword 

U K = min ~ 1 p K (x* n ,y) 

V&Ck 

and the appropriate channel symbol, u n , produced by the channel. The sliding 
block encoder thus has length at most 2 NJ + 1. 

The decoder sliding block code g scans left N symbols to see if it finds a 
codebook sync sequence (remember the codebook is asynchronous and begins 
with a unique prefix or sync sequence). If it does not find one, it produces a 
reference letter. (In this case it is not in the middle of a code word.) If it 
does find one starting in position —n, then it produces the corresponding length 
N codeword from C and then puts out the reproduction symbol in position n. 
Note that the decoder sliding block code has a finite window length of at most 
2N+1. 

We now evaluate the average distortion resulting from use of this sliding 
block code. As a first step we mimic the proof of Lemma 10.6.3 up to the 
assumption of mutual independence of the source and the punctuation process 
(which is not the case here) to get that for a long source sequence of length n if 
the punctuation sequence is z, then 

p n (x n ,x n )= Y P{xi 1 a*)+ Y Pnj(x? j ,x? j ), 

i€Jq(z) i€J™(z) 

where Jq{z) is the collection of all i for which Zi is not in an N J-cell (and hence 
filler is being sent) and J"(z) is the collection of all i for which is 0 and hence 
begins an TVJ-cell and hence an NJ length codeword. Each one of these length 
NJ codewords contains at most N reference letters at the beginning and N 
references letters at the end the end and in the middle it contains all shifts of 
sequences of length N codewords from C. Thus for any i £ J™( 2 ), we can write 
that 

LiJ+JiV - 1 

PNj(xf J ,x^ J )<pN(x^,a* ) + PN(xf +NJ _ N , a* )+ ^ Pn{x^ ,C). 

1 = LjvJ 

This yields the bound 

-p n (x n ,x n ) < - Y P( X i, a *) 

n n z ' 

ieJ£(z) 

+ — 2_^ yPN(%i ) + PN(%i+ N J- N , CL )J 

i£j™(z) 
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1 LwJ y 71-1 

+-'52P N ( x jN, c ) = ~^2h(zi)p(xi,a*) 

j= 0 i— 0 

i " _1 / \ i 

+ — lo(^i) ( Pn(x ™ , a* W ) + Pn{x^ + nj _ n , a* N )j + — ^ pn{x^ n ,C), 

i— 0 j=0 

where l a (^) is 1 if ^ = a and 0 otherwise. Taking expectations above we have 
that 

1 1 ”~ 1 

E(-p n (X n ,X n )) <-Y,E{WZMXi,a*)) 

n n 

i = o 

1 n_1 / \ 1 LwJ 

+ ~ y~](lp(^i) \ Pn{X ,a* ) + pn(X^_ nj _ n , a* ) ) ) + — ^ pn{X^ n ,C). 
2—0 j = 0 

Invoke stationarity to write 

E(^p n (X n ,X n )) < E(l 2 (Z 0 )p(X 0 ,a*)) 

+ ± J E(l 0 (Z 0 )p 2 N + l(X 2N+ \a* {2N+1) )) + ± Pn (X n ,C). 

The first term is bounded above by S from (11.23). The middle term can be 
bounded above using (11.22) by 

J^E(l 0 (Z 0 )p 2N+1 (X 2N+1 ,a* (2N+1) ) < -T Ep 2N+1 (X 2N+1 ,a* (2N+1) ) 

= - J L(2N + l)p*<(j + l)p*<6. 

Thus we have from the above and (11.20) that 

Ep(X 0 ,Y 0 ) <p(C,p) + 3S. 

This proves the existence of a finite window sliding block encoder and a finite 
window length decoder with performance arbitrarily close to that achievable by 
block codes. □ 

The only use of ergodicity in the proof of the theorem was in the selection 
of the source sync sequence used to imbed the block code in a sliding block 
code. The result would extend immediately to nonergodic stationary sources 
(and hence to nonergodic AMS sources) if we could somehow find a single source 
sync sequence that would work for all ergodic components in the ergodic de- 
composition of the source. Note that the source synch sequence affects only the 
encoder and is irrelevant to the decoder which looks for asynchronous codewords 
prefixed by channel synch sequences (which consisted of a single channel letter 
repeated several times). Unfortunately, one cannot guarantee the existence of a 
single source sequence with small but nonzero probability under all of the ergodic 




11.7. SLIDING BLOCK SOURCE CODES 



239 



components. Since the components are ergodic, however, an infinite length slid- 
ing block encoder could select such a source sequence in a simple (if impractical) 
way: Proceed as in the proof of the theorem up to the use of Corollary 9.4.2. 
Instead of using this result, we construct by brute force a punctuation sequence 
for the ergodic component in effect. Suppose that Q = {G*; i = 1,2,---} is a 
countable generating field for the input sequence space. Given 5 , the infinite 
length sliding block encoder first finds the smallest value of i for which 

1 n— 1 

0 < lim — 1 Gi(T k x), 

n —> oo TL ' 

k = 0 



and 

^ n — 1 

lim - V' 1 Gi (T k x)p(x k: a*) < 5, 

n—> oo 71 z ' 
k—0 

that is, we find a set with strictly positive relative frequency (and hence strictly 
positive probability with respect to the ergodic component in effect) which oc- 
curs rarely enough to ensure that the sample average distortion between the 
symbols produced when Gj occurs and the reference letter is smaller than S. 
Given N and <5 there must exist an i for which these relations hold (apply the 
proof of Lemma 9.4.4 to the ergodic component in effect with 7 chosen to sat- 
isfy (11.23) for that component and then replace the arbitrary set G by a set 
in the generating field having very close probability). Analogous to the proof of 
Lemma 9.4.4 we construct a punctuation sequence { Z n } using the event Gi in 
place of G. The proof then follows in a like manner except that now from the 
dominated convergence theorem we have that 

^ n— 1 

E{l 2 (Z 0 )p(X 0 ,a*)) = lim - V E^Z^Xi, a*) 

n—> oo 71 z ' 

2=0 



1 -i t 

= E( lim - 1 2 {Zi)p(Xi,a*)) < 5 

n —* 00 71 z ' 

2=0 

by construction. 

The above argument is patterned after that of Davisson and Gray [27] and 
extends the theorem to stationary nonergodic sources if infinite window sliding 
block encoders are allowed. We can then approximate this encoder by a finite- 
window encoder, but we must make additional assumptions to ensure that the 
resulting encoder yields a good approximation in the sense of overall distortion. 
Suppose that / is the infinite window length encoder and g is the finite window- 
length (say 2 L + 1) encoder. Let Q denote a countable generating field of 
rectangles for the input sequence space. Then from Corollary 4.2.2 applied 
to Q given e > 0 we can find for sufficiently large N a finite window sliding 
block code r : A 2N+1 — > B such that Pr(r ^ f) < e/(2 L + 1 ), that is, the two 
encoders produce the same channel symbol with high probability. The issue is 
when does this imply that p(fg,g) and p^rg^p) are therefore also close, which 
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would complete the proof. Let r : A T — > B denote the infinite-window sliding 
block encoder induced by r, i.e. , f(x) = r(x 2N N ^~ 1 ). Then 

p(fg,p) = E(p(X o,X 0 ))= E [ dp(x)p(x 0 ,g{b)), 

be B 2 L+i JxeVf{b) 



where 

Vf(b) = {x : f(x) 2L+1 = b}, 

where f(x) 2L+1 is shorthand for f(xi), i = that is, the channel 

(2 L + l)-tuple produced by the source using encoder x. We therefore have that 

p(rg,p)< Y / dp(x)p(x 0 ,g(b)) 

b£B 2L + 1 dx&V;(b) 

+ V / dp(x)p(x 0 ,g(b)) 

b£B 2L + 1 d xeVr(b) — Vf (b) 

= p(f,p)+ V / dp(x)p(x 0 ,g{b)) 

b £ B 2L+i JxEVf(b) — Vf(b) 

<p(f,p)+ E / dp(x)p(x 0 , g(b)). 

b£B 2L +i d xGVf(b)AVf(b) 

By making N large enough, however, we can make 

p(Vr(f)AVf(b)) 

arbitrarily small simultaneously for all b £ A 2L + 1 and hence force all of the 
integrals above to be arbitrarily small by the continuity of integration. With 
Lemma 11.7.1 and Theorem 11.7.1 this completes the proof of the following 
theorem. 

Theorem 11.7.2: Given an AMS source p and an additive fidelity criterion 
with a reference letter, 

Ssbc{R, p) = b{R, p), 

that is, the class of sliding block codes is capable of exactly the same performance 
as the class of block codes. 

The sliding block source coding theorem immediately yields an alternative 
coding theorem for a code structure known as trellis encoding source codes 
wherein the sliding block decoder is kept but the encoder is replaced by a tree 
or trellis search algorithm such as the Viterbi algorithm [41]. The details of 
inferring the trellis encoding source coding theorem from the sliding-block source 
coding theorem can be found in [52] . 
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11.8 A Geometric Interpretation of OPTA’s 

We close this chapter on source coding theorems with a geometric interpretation 
of the OPTA functions in terms of the p distortion between sources. Suppose 
that p is a stationary and ergodic source and that {pn} is an additive fidelity 
criterion with a fidelity criterion. Suppose that we have a nearly optimal sliding 
block encoder and decoder for p and a channel with rate R, that is, if the overall 
process is {X n ,X n } and 

Ep(X 0 ,X 0 )<6(R,p) + 6. 

If the overall hookup (source/encoder/channel/decoder) yields a distribution p 
on {X n ,X n } and distribution rj on the reproduction process {X n }, then clearly 

p(p,v) < 6(R,p) + 8. 

Furthermore, since the channel alphabet is B the channel process must have 
entropy rate less than R = log||B|| and hence the reproduction process must 
also have entropy rate less than B from Corollary 4.2.5. Since S is arbitrary, 

5{R,p)> Jnf p(p,i 7 ). 

Suppose next that p, p and rj are stationary and ergodic and that H(ifj < R. 
Choose a stationary p having p and 77 as coordinate processes such that 

E p p(X 0 , Yo) < p(p, v) + S. 

We have easily that I(X-Y) < ) < R and hence the left hand side is bounded 

below by the process distortion rate function D s (R 1 p). From Theorem 10.6.1 
and the block source coding theorem, however, this is just the OPTA function. 
We have therefore proved the following: 

Theorem 11.8.1: Let p be a stationary and ergodic source and let {/?„} 
be an additive fidelity criterion with a reference letter. Then 

S(R,p)= inf p(p,i 7 ), 

rr-H( V )<R 

that is, the OPTA function (and hence the distortion-rate function) of a station- 
ary ergodic source is just the “distance” in the p sense to the nearest stationary 
and ergodic process with the specified reproduction alphabet and with entropy 
rate less than R. 

This result originated in [55]. 
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Chapter 12 



Coding for noisy channels 



12.1 Noisy Channels 

In the treatment of source coding the communication channel was assumed to 
be noiseless. If the channel is noisy, then the coding strategy must be different. 
Now some form of error control is required to undo the damage caused by the 
channel. The overall communication problem is usually broken into two pieces: 
A source coder is designed for a noiseless channel with a given resolution or rate 
and an error correction code is designed for the actual noisy channel in order 
to make it appear almost noiseless. The combination of the two codes then 
provides the desired overall code or joint source and channel code. This division 
is natural in the sense that optimizing a code for a particular source may suggest 
quite different structure than optimizing it for a channel. The structures must 
be compatible at some point, however, so that they can be used together. 

This division of source and channel coding is apparent in the subdivision of 
this chapter. We shall begin with a basic lemma due to Feinstein [38] which is at 
the basis of traditional proofs of coding theorems for channels. It does not con- 
sider a source at all, but finds for a given conditional distribution the maximum 
number of inputs which lead to outputs which can be distinguished with high 
probability. Feinstein’s lemma can be thought of as a channel coding theorem 
for a channel which is used only once and which has no past or future. The 
lemma immediately provides a coding theorem for the special case of a channel 
which has no input memory or anticipation. The difficulties enter when the con- 
ditional distributions of output blocks given input blocks depend on previous or 
future inputs. This difficulty is handled by imposing some form of continuity 
on the channel with respect to its input, that is, by assuming that if the chan- 
nel input is known for a big enough block, then the conditional probability of 
outputs during the same block is known nearly exactly regardless of previous 
or future inputs. The continuity condition which we shall consider is that of 
d-continuous channels. Joint source and channel codes have been obtained for 
more general channels called weakly continuous channels (see, e.g., Kieffer [80] 
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[81]), but these results require a variety of techniques not yet considered here 
and do not follow as a direct descendent of Feinstein’s lemma. 

Block codes are extended to sliding-block codes in a manner similar to that 
for source codes: First it is shown that asynchronous block codes can be syn- 
chronized and then that the block codes can be “stationarized” by the insertion 
of random punctuation. The approach to synchronizing channel codes is based 
on a technique of Dobrushin [33] . 

We consider stationary channels almost exclusively, thereby not including 
interesting nonstationary channels such as finite state channels with an arbi- 
trary starting state. We will discuss such generalizations and we point out that 
they are straightforward for two-sided processes, but the general theory of AMS 
channels for one-sided processes is not in a satisfactory state. Lastly, we empha- 
size ergodic channels. In fact, for the sliding block codes the channels are also 
required to be totally ergodic, that is, ergodic with respect to all block shifts. 

As previously discussed, we emphasize digital, i.e., discrete, channels. A 
few of the results, however, are as easily proved under somewhat more general 
conditions and hence we shall do so. For example, given the background of this 
book it is actually easier to write things in terms of measures and integrals than 
in terms of sums over probability mass functions. This additional generality 
will also permit at least a description of how the results extend to continuous 
alphabet channels. 



12.2 Feinstein’s Lemma 



Let (A, Ba) and ( B,Bb ) be measurable spaces called the input space and the 
output space, respectively. Let Px denote a probability distribution on (A, Ba) 
and let v(F\x), F € Bb, x £ B denote a regular conditional probability distri- 
bution on the output space, v can be thought of as a “channel” with random 
variables as input and output instead of sequences. Define the hookup Pxv = 
Pxy by 

Pxy(F) = J dP x (x)v(F x \x). 

Let Py denote the induced output distribution and let Px x Py denote the 
resulting product distribution. Assume that Pxy << (Px x Py) and define the 
Radon-Nikodym derivative 



_ dP X Y 
1 ~ d(P X x Py) 



( 12 . 1 ) 



and the information density 



i(x,y) = In f(x,y). 



We use abbreviated notation for densities when the meanings should be clear 
from context, e.g., / instead of fxY- Observe that for any set F 



dP x (x) (/ dP Y (y)f(x,y) S j = J d(P x x P Y )(x,y)f(x,y) 
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= / dP XY (x,y) = P x (B)<l 

JFxB 

and hence 

J dP Y (y)f(x, y) < 1; P x - a.e. (12.2) 

Feinstein’s lemma shows that we can pick M inputs {xi £ A] i = 1, 2, • • • , M}, 
and a corresponding collection of M disjoint output events {T,; G Bb', i = 
1,2, • • • , Af}, with the property that given an input a;* with high probability 
the output will be in F,. We call the collection C = {xi,Ti; i = 1, 2, • • • , M} a 
code with codewords a;,; and decoding regions Fj. We do not require that the r, 
exhaust B. 

The generalization of Feinstein’s original proof for finite alphabets to general 
measurable spaces is due to Kadota [70] and the following proof is based on his. 

Lemma 12.2 Feinstein’s Lemma: Given an integer M and a > 0 there 
exist Xi £ A-, i = 1, • • • , M and a measurable partition T = {Tj; i = 1, • • • , M} 
of B such that 

i/(r?|*i) < Me~ a + P XY {i < a). 

Proof: Define G = {x,y : i(x,y) > a} Set e = Me~ a + Pxy(i < a) = 
Me~“ + Px Y {G c ). The result is obvious if e > 1 and hence we assume that 
e < 1 and hence also that 

Pxy(G c ) < e < 1 

and therefore that 

Pxy( i > a) = Pxy(G) = J dPx{x)v(G x \x) > 1 — e > 0. 

This implies that the set A = {x : v(G x \x) > 1 — e and (12.2) holds} must have 
positive measure under P x We now construct a code consisting of input points 
Xi and output sets T x .. Choose an X\ £ A and define T Xl = G Xl . Next choose 
if possible a point X 2 £ A for which v(G X2 — V Xl \x 2 ) > 1 — e. Continue in this 
way until either M points have been selected or all the points in A have been 
exhausted. In particular, given the pairs {xj, r ? }; j = 1, 2, • • • , i — 1, satisfying 
the condition, find an Xi for which 

v{G Xi -{jT x .\xi)>l-e. (12.3) 

j<i 

If the procedure terminates before M points have been collected, denote the 
final point’s index by n. Observe that 

v{T Xi c \xi) < v(G Xi c \xi ) < e; i = 1, 2, • • • , n 

and hence the lemma will be proved if we can show that necessarily n cannot 
be strictly less than M. We do this by assuming the contrary and finding a 
contradiction. 
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Suppose that the selection has terminated at n < M and define the set 
F = (J" =1 € Bb- Consider the probability 



Pxy{G) = P xy (G[)(A x F)) + P XY (Gf)(A x F c )). 
The first term can be bounded above as 

Pxy{G^\{A xF))< Pxy(A x F) = Py(F) 



E p > d',, !- 

i = 1 

We also have from the definitions and from (12.2) that 

P Y {T Xi ) = [ dP Y {y) < f dP Y (y) 

Jr x . JG Xi 

< dP Y {y) < e~ a j dP Y (y)f(xi,y) < e~ a 

and hence 

P XY (Gf](Ax F))<ne~ a . 

Consider the second term of (12.3): 

Pxy(G[)(A x F c )) = J dP x {x)v{{G[^{A x F c )) x \x) 
= f dP x (xMG x f]F c \x) = f dP x (x)u(G x -\jT t \x). 

^ ^ i= 1 

We must have, however, that 

n 

v(G x - (J Til*) < 1 - e 



i= 1 



(12.4) 



(12.5) 



( 12 . 6 ) 



with P x probability 1 or there would be a point a: n +i for which 

n+1 

v(G Xn+1 |^J Ti\x n+1 ) >1 6, 

i = 1 

that is, (12.3) would hold for i = n + 1, contradicting the definition of n as the 
largest integer for which (12.3) holds. Applying this observation to (12.6) yields 

P A -r(Gp|(AxF c ))<l-e 

which with (12.4) and (12.5) implies that 

Pxy(G) < ne~ a + 1 - e. (12.7) 

From the definition of e, however, we have also that 

Pxy(G) = 1 - P X y(G c ) = 1 e T Me~ a 

which with (12.7) implies that M < n, completing the proof. □ 
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12.3 Feinstein’s Theorem 



Given a channel [A, n, B\ an (M, n, e) block channel code for v is a collection 
i = 1, 2, • • • , M, where £ A n , I , £ Eg, all i, with the property that 

sup max z/”(r;) < e, (12.8) 

x£c(wi) 



where c(a") = {a; : x n = a n } and where v™ is the restriction of v x to Eg. 
The rate of the code is defined as n _1 logAf. Thus an (■ n,M,e ) channel code 
is a collection of M input n-tuples and corresponding output cells such that 
regardless of the past or future inputs, if the input during time 1 to n is a 
channel codeword, then the output during time 1 to n is very likely to lie in 
the corresponding output cell. Channel codes will be useful in a communication 
system because they permit nearly error free communication of a select group 
of messages or codewords. A communication system can then be constructed 
for communicating a source over the channel reliably by mapping source blocks 
into channel codewords. If there are enough channel codewords to assign to all 
of the source blocks (at least the most probable ones), then that source can 
be reliably reproduced by the receiver. Hence a fundamental issue for such an 
application will be the number of messages M or, equivalently, the rate R of a 
channel code. 

Feinstein’s lemma can be applied fairly easily to obtain something that re- 
sembles a coding theorem for a noisy channel. Suppose that [A, v, B] is a channel 
and [A,n] is a source and that [Ax B,p = pv] is the resulting hookup. De- 
note the resulting pair process by {X n ,Y n } For any integer K let p K denote 
the restriction of p to (A K x B K , x Eg), that is, the distribution on in- 
put/output AT-tuples (X K ,Y K ). The joint distribution p K together with the 
input distribution p K induce a regular conditional probability v K defined by 
v k {F\x k ) = Pr (Y k £ F\X k = x K ). In particular, 



u K (G\a K ) = Pr (Y k £ G\X K = a K ) 



1 

p K (a K ) 



[ v%{G)dp{x). 

J c(a K ) 



(12.9) 



where c(a K ) = {x : x K = a K } is the rectangle of all sequences with a com- 
mon /('-dimensional output. We call v K the induced K-dimensional channel 
of the channel v and the source p. It is important to note that the induced 
channel depends on the source as well as on the channel, a fact that will cause 
some difficulty in applying Feinstein’s lemma. An exception to this case which 
proves to be an easy application is that of a channel without input memory and 
anticipation, in which case we have from the definitions that 



v K (F\a K ) = v x (Y k £ F); x£ c(a K ), 



Application of Feinstein’s lemma to the induced channel yields the following 
result, which was proved by Feinstein for stationary finite alphabet channels 
and is known as Feinstein’s theorem: 
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Lemma 12.3.1: Suppose that [A x B , fiv\ is an AMS and ergodic hookup of 
a source p and channel v. Let = I^ V {X\ Y) denote the average mutual infor- 
mation rate and assume that I ^ = /*„ is finite (as is the case if the alphabets 
are finite (Theorem 6.4.1) or have the finite-gap information property (Theorem 
6.4.3)). Then for any R < I ^ and any e > 0 there exists for sufficiently large 
n a code {u^T,; i = 1,2, •• • ,M}, where M = |_e nii J, w " € A n , and Tj g 
with the property that 



i> n (Tf|<) <e,* = l,2,--.,M. (12.10) 

Comment: We shall call a code {wi, Tj; i = 1, 2, • • • , M} which satisfies (12.10) 
for a channel input process /i a (//., M, n, e)-Feinstein code. The quantity rT 1 log M 
is called the rate of the Feinstein code. 

Proof: Let r/ denote the output distribution induced by p and v. Define the 
information density 

. _ dp n 
n (. dp n x p n ) 

and define 

<5 = J, ‘ v ~ R > 0. 

2 

Apply Feinstein’s lemma to the n-dimensional hookup (pv) n with M = Le"^) 
and a = n(R + S) to obtain a code {w{, Tj}; i = 1, 2, • • • , M with 

maxf> n (r?K) < Me~ n(R+5) + p n {i n < n(R + 5)) 

i 

= [e nR \ e~ n ( R+s ) +p (U n (X n -,Y n ) <R + S) (12.11) 

and hence 

max^rfK) < e ~ nS +p(-i n (X n -Y n ) < - S). (12.12) 

i n 

From Theorem 6.3.1 n~ 1 i n converges in L 1 to I llv and hence it also converges 
in probability. Thus given e we can choose an n large enough to ensure that the 
right hand side of (12.11) is smaller than e, which completes the proof of the 
theorem. □ 

We said that the lemma “resembled” a coding theorem because a real coding 
theorem would prove the existence of an (M, n, e) channel code, that is, it would 
concern the channel v itself and not the induced channel h, which depends on a 
channel input process distribution p. The difference between a Feinstein code 
and a channel code is that the Feinstein code has a similar property for an 
induced channel which in general depends on a source distribution, while the 
channel code has this property independent of any source distribution and for 
any past or future inputs. 

Feinstein codes will be used to construct block codes for noisy channels. The 
simplest such construction is presented next. 
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Corollary 12.3.1: Suppose that a channel [A,is,B] is input memoryless 
and input nonanticipatory (see Section 9.4). Then a (/q M, n, e)-Feinstein code 
for some channel input process p is also an (M, n, e)-code. 

Proof: Immediate since for a channel without input memory and anticipation 
we have that v™(F) = v™(F) if x n — u n . □ 

The principal idea of constructing channel codes from Feinstein codes for 
more general channels will be to place assumptions on the channel which ensure 
that for sufficiently large n the channel distribution i/™ and the induced finite 
dimensional channel v n {- \x n ) are close. This general idea was proposed by 
McMillan [103] who suggested that coding theorems would follow for channels 
that were sufficiently continuous in a suitable sense. 

The previous results did not require stationarity of the channel, but in a 
sense stationarity is implicit if the channel codes are to be used repeatedly (as 
they will be in a communication system). Thus the immediate applications of 
the Feinstein results, will be to stationary channels. 

The following is a rephrasing of Feinstein’s theorem that will be useful. 

Corollary 12.3.2: Suppose that [A x B, pv] is an AMS and ergodic hookup 
of a source /i and channel v. Let / M „ = I^ V (X-,Y) denote the average mutual 
information rate and assume that I = I* v is finite. Then for any R < I f, v and 
any e > 0 there exists an no such that for all n > no there are (/x, [e nR \ , n, e)- 
Feinstein codes. 

As a final result of the Feinstein variety, we point out a variation that applies 
to nonergodic channels. 

Corollary 12.3.3: Suppose that [A x B, pn] is an AMS hookup of a source 
/.t and channel v. Suppose also that the information density converges a.e. to a 
limiting density 

i oa = lim -i n (X n ; Y n ). 

n—> oo Ti 

(Conditions for this to hold are given in Theorem 8.5.1.) Then given e > 0 and 
S > 0 there exists for sufficiently large n a [//, A/, n, eA/zz^oo < i? + 5)] Feinstein 
code with M = \e nR \ ■ 

Proof: Follows from the lemma and from Fatou’s lemma which implies that 
limsupp(— i n (X n -, Y n ) < a) < p(ioo < a). □ 

n—> 00 



12.4 Channel Capacity 

The form of the Feinstein lemma and its corollaries invites the question of how 
large R (and hence M ) can be made while still getting a code of the desired 
form. From Feinstein’s theorem it is seen that for an ergodic channel R can be 
any number less than I(pi') which suggests that if we define the quantity 

Cams, e = sup (12.13) 

AMS and ergodic fi 
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then if J MJ/ = I* v (e.g., the channel has finite alphabet), then we can construct 
for some /i a Feinstein code for p with rate R arbitrarily near Cams, e- Cams, e is 
an example of a quantity called an information rate capacity or, simply, capacity 
of a channel. We shall encounter a few variations on this definition just as there 
were various ways of defining distortion-rate functions for sources by considering 
either vectors or processes with different constraints. In this section a few of 
these definitions are introduced and compared. 

A few possible definitions of information rate capacity are 

Cams = sup I IIU . (12.14) 

AMS fi 

C s = sup J Mt/ , 

stationary fi 

Cs, e — Slip I fiv i 

stationary and ergodic fi 

Cn S = SUp Ifivi 

n— stationary /j, 

Cbs = sup Ifj v = sup sup /, 

block stationary fi n n— stationary fi 

Several inequalities are obvious from the definitions: 

Cams > C bs > C ns > C s > C s , e (12.19) 

Cams > CAMS,e > C s> e - (12.20) 

In order to relate these definitions we need a variation on Lemma 12.3.1 de- 
scribed in the following lemma. 

Lemma 12.4.1: Given a stationary finite-alphabet channel [A,n,B\, let p. 
be the distribution of a stationary channel input process and let {p x } be its 
ergodic decomposition. Then 

= J dp{x)I llxV . (12.21) 



(12.15) 

(12.16) 

(12.17) 

(12.18) 



Proof: We can write 



Ifiv = h\(p) - /j 2 (m) 



where 

h 1 (p)=H r ,(Y)=inf-H ri (Y n ) 

nn 

is the entropy rate of the output, where rj is the output measure induced by p 
and u, and where 



h 2 (p) = H^(Y\X) = lim -H^(Y n \X n ) 

n—> oo n 



is the conditional entropy rate of the output given the input. If /q, — » p on any 
finite dimensional rectangle, then also 77 *, — 77 and hence 
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and hence it follows as in the proof of Corollary 2.4.1 that hi(/T) is an upper 
semicontinuous function of / jl . It is also affine because H V (Y) is an affine function 
of 77 (Lemma 2.4.2) which is in turn a linear function of /r. Thus from Theorem 
8.9.1 of [50] 

MaO = J dfj,(x)h 

/i 2 (/i) is also affine in ^ since hi(fi) is affine in /1 and 1^ is affine in /i (since it 
is affine in [iv from Lemma 6.2.2). Hence we will be done if we can show that 
/i 2 (/i) is upper semicontinuous in p since then Theorem 8.9.1 of [50] will imply 
that 

MaO = J dfj,(x)h 2 (nx) 

which with the corresponding result for hi proves the lemma. To see this observe 
that if Hk —7 A 4 on finite dimensional rectangles, then 

H likV (Y n \X n ) - H^(Y n \X n ). (12.22) 

Next observe that for stationary processes 

H(Y n \X n ) < H(Y m \X n ) + H(Y™- m \X n ) 

< H(Y m \X m ) + H(Y™- m \X™- m ) = H(Y m \X m ) + H(Y n - m \X n - m ) 

which as in Section 2.4 implies that H{Y n \X n ) is a subadditive sequence and 
hence 

lim -H{Y n \X n ) = inf -H(Y n \X n ). 

n— >00 Tl n n 

Coupling this with (12.22) proves upper semicontinuity exactly as in the proof 
of Corollary 2.4.1, which completes the proof of the lemma. □ 

Lemma 12.4.2: If a channel v has a finite alphabet and is stationary, then 
all of the above information rate capacities are equal. 

Proof: From Theorem 6.4.1 1=1* for finite alphabet processes and hence 
from Lemma 6.6.2 and Lemma 9.3.2 we have that if /. 1 is AMS with stationary 
mean j a, then 

I [IV djll' IfiV 

and thus the supremum over AMS sources must be the same as that over sta- 
tionary sources. The fact that C s < C Si e follows immediately from the previous 
lemma since the best stationary source can do no better than to put all of 
its measure on the ergodic component yielding the maximum information rate. 
Combining these facts with (12.19)-(12.20) proves the lemma. □ 

Because of the equivalence of the various forms of information rate capacity 
for stationary channels, we shall use the symbol C to represent the information 
rate capacity of a stationary channel and observe that it can be considered as 
the solution to any of the above maximization problems. 

Shannon’s original definition of channel capacity applied to channels without 
input memory or anticipation. We pause to relate this definition to the process 
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definitions. Suppose that a channel [A,v,B\ has no input memory or antici- 
pation and hence for each n there are regular conditional probability measures 
u n {G\x n ); xeA n ,Ge B n Bl such that 

K(G) = v n (G\x n ). 

Define the finite-dimensional capacity of the v n by 

C n {y n ) = sup J M n f >n(X";F n ), 
n n 

where the supremum is over all vector distributions /i" on A n . Define the 
Shannon capacity of the channel /i by 

^Shannon = Km -<?"(*>") 
n—> oo 71 

if the limit exists. Suppose that the Shannon capacity exists for a channel v 
without memory or anticipation. Choose N large enough so that Cat is very 
close to Cshannon and let /j' v approximately yield Cm- Then construct a block 
memoryless source using /i ,v . A block memoryless source is AMS and hence if 
the channel is AMS we must have an information rate 

I^(X-Y)= lim -I^(X n -Y n )= lim -±- I^(X kN ; Y kN ). 

n—>oo n fc — -oo kN 

Since the input process is block memoryless, we have from Lemma 9.4.2 that 
I{ X kN ;Y kN )>^I(X? N -,Y»). 

i = 0 

If the channel is stationary then { X n , Y n } is A-stationary and hence if 
~^In N i> N (■ X N ; Y N ) > Cshannon - £, 

then 

^I(X kN - Y kN ) > Cshannon - C. 

Taking the limit as k — > oo we have that 

Cams = C > I(X- Y) = lim ^I(X kN ; Y kN ) > C Shannon - e 

k — ►oo kPs 

and hence 

C > C S h annon • 



Conversely, pick a stationary source /i which nearly yields C = C s , that is, 

I/lv ^ C s e. 
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Choose n 0 sufficiently large to ensure that 

1 I^(X n -Y n )>I lxu -e>C s - 2e. 
n 

This implies, however, that for n > no 

C n >C s - 2e, 

and hence application of the previous lemma proves the following lemma. 

Lemma 12.4.3: Given a finite alphabet stationary channel v with no input 
memory or anticipation, 



/T r i 

— c^AMS — '- y s — '-'s, e — '-'Shannon* 

The Shannon capacity is of interest because it can be numerically computed 
while the process definitions are not always amenable to such computation. 

With Corollary 12.3.2 and the definition of channel capacity we have the 
following result. 

Lemma 12.4.4: If v is an AMS and ergodic channel and R < C, then there 
is an no sufficiently large to ensure that for all n > no there exist (^, [e nR \ , n, e) 
Feinstein codes for some channel input process /r. 

Corollary 12.4.1: Suppose that [A, n, B] is an AMS and ergodic channel 
with no input memory or anticipation. Then if R < C, the information rate 
capacity or Shannon capacity, then for e > 0 there exists for sufficiently large n 
a ( [ e nR \ , n, e) channel code. 

Proof: Follows immediately from Corollary 12.3.3 by choosing a stationary 
and ergodic source /r with 1 ^ G ( R , C) . □ 

There is another, quite different, notion of channel capacity that we intro- 
duce for comparison and to aid the discussion of nonergodic stationary channels. 
Define for an AMS channel v and any A € (0, 1) the quantile 

C*(A) = sup sup{r : pn{ioo < r) < A)}, 

AMS m 

where the supremum is over all AMS channel input processes and too is the 
limiting information density (which exists because pn is AMS and has finite 
alphabet). Define the information quantile capacity C* by 

C * = lim C*(A). 

A— >0 

The limit is well defined since the C*(A) are bounded and nonincreasing. The 
information quantile capacity was introduced by Winkelbauer [149] and its prop- 
erties were developed by him and by Kieffer [75]. Fix an R < C* and define 
<5 = (C* — R) /2. Given e > 0 we can find from the definition of C* an AMS chan- 
nel input process p for which pufi^ < R + 5) < e. Applying Corollary 12.3.3 
with this S and e/2 then yields the following result for nonergodic channels. 
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Lemma 12.4.5: If v is an AMS channel and R < C*, then there is an 
no sufficiently large to ensure that for all n > no there exist (/*, fe nR f 1 n, e) 
Feinstein codes for some channel input process /r. 

We close this section by relating C and C* for AMS channels. 

Lemma 12.4.6: Given an AMS channel v. 

C > C*. 

Proof: Fix A > 0. If r < C*( A) there is a fi such that A > /zz*(*oo < r) = 
1 — [jivfioQ > r) > llpv/r, where we have used the Markov inequality. Thus for 
all r < C* we have that I ^ > r(l — nv(ioo < r)) and hence 

C>I^> C*(A)(1 - A) -► (7*. □ 

It can be shown that if a stationary channel is also ergodic, then C = C* by 
using the ergodic decomposition to show that the supremum defining C(A) can 
be taken over ergodic sources and then using the fact that for ergodic fi and u, 
*00 equals / ;iI , with probability one. (See Kieffer [75].) 

12.5 Robust Block Codes 

Feinstein codes immediately yield channel codes when the channel has no in- 
put memory or anticipation because the induced vector channel is the same 
with respect to vectors as the original channel. When extending this technique 
to channels with memory and anticipation we will try to ensure that the in- 
duced channels are still reasonable approximations to the original channel, but 
the approximations will not be exact and hence the conditional distributions 
considered in the Feinstein construction will not be the same as the channel 
conditional distributions. In other words, the Feinstein construction guarantees 
a code that works well for a conditional distribution formed by averaging the 
channel over its past and future using a channel input distribution that approx- 
imately yields channel capacity. This does not in general imply that the code 
will also work well when used on the unaveraged channel with a particular past 
and future input sequence. We solve this problem by considering channels for 
which the two distributions are close if the block length is long enough. 

In order to use the Feinstein construction for one distribution on an actual 
channel, we will modify the block codes slightly so as to make them robust in 
the sense that if they are used on channels with slightly different conditional 
distributions, their performance as measured by probability of error does not 
change much. In this section we prove that this can be done. The basic technique 
is due to Dobrushin [33] and a similar technique was studied by Ahlswede and 
Gacs [4]. (See also Ahlswede and Wolfowitz [5].) The results of this section are 
due to Gray, Ornstein, and Dobrushin [59]. 

A channel block length n code {wi , T , : * = 1,2,---,M will be called 5- 
robust (in the Hamming distance sense) if the decoding sets Tj are such that the 
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expanded sets 


(T i ) 5 = {y n :-d n (y n ,T i )<6} 




n 


are disjoint, where 


d n (y n ,Ti)= min d n (y n ,u n ) 




u n eTi 


and 


n — 1 




d n {y n ,u n ) = J2d H (yi,Ui) 



i= 0 



and dn{a, b) is the Hamming distance (1 if a ^ b and 0 if a = b). Thus the code 
is S robust if received n-tuples in a decoding set can be changed by an average 
Hamming distance of up to 6 without falling in a different decoding set. We 
show that by reducing the rate of a code slightly we can always make a Feinstein 
code robust. 

Lemma 12.5.1: Let {wf, T'; i = 1,2,---, M'} be a {p. e nR , n , e)-Feinstein 
code for a channel v. Given S £ (0, 1/4) and 

R < R' - h 2 {26) - 2<51og(||H|| - 1), 

where as before h . 2 (a) is the binary entropy function —a log a — (1 — a) log(l — a), 
there exists a 5-robust (/ jl , [e nR \ , n, e„)-Feinstein code for v with 

£ < e + e -n(R' -R-h 2 (26)-25log(\\B\\-l)-3/n) ' 



Proof: For i = 1,2 ,M' let ri(y n ) denote the indicator function for (r.j^. 
For a fixed y n there can be at most 



2Sn 

E 

i—0 



(l|s||-i) i = l|s| 



25n 

E 

2=0 



( 1 - 



\B\ 



-yEy 

v UBir 



n-tuples b n G B n such that n 1 d n (y n ,b n ) < 26. Set p = 1 — 1/||B|| and apply 
Lemma 2.3.5 to the sum to obtain the bound 




1 

W\ 



) i (TT^Tl) n ” 1 < \\B\\ n e- nh ^ mp) 



I B 



— e -nh 2 (2S\\p)+nlog\\B\\ 

where 

h 2 (26\\p) = 26 In j- (1 — 25) In — 

P 1 ~P 

= —h 2 (6) + 26 In ,, jj^ 11 +(l-25)ln||E|| = -h 2 (6) +ln ||B|| - 251n(||H|| - 1). 

II-dII - 1 

Combining this bound with the fact that the T,; are disjoint we have that 



M' 2Sn 






(||H|| — 1 )* < e - n ( ft 2 ( 2 < 5 )+ 2 < 51 n(||B||-l) 
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Set M = [e nR \ and select 2 M subscripts k\, ■ ■ ■ , fc 2 M from {1, • • • , M'} by 
random equally likely independent selection without replacement so that each 
index pair ( kj , k m ); j,m = 1, • ■ • , 2 M; j ^ m, assumes any unequal pair with 
probability ( M'(M ' — 1)) _1 . We then have that 



E 



1 2 M 2 M 

2mS £ 

j — 1 m—l,m^pj 



HT'kj OVUM,) 



1 2 M 2 M M' M' 1 

= 2mE E EE M , (M , _ 1) E %"KW) 

1 2M 2M M' 1 M' 

^m/E E E M , {M , i) E E r W) 

2M n (fe 2 (26) + 2glo R (||B||-l) < 4 -n(fl'-fl-fe 2 (25)-251og(||B||-l) = y 

-M-l - e “ 

where we have assumed that M' > 2 so that M' — 1 > M' / 2. Analogous to 
a random coding argument, since the above expectation is less than A„, there 
must exist a fixed collection of subscripts ii, • ■ ■ , i 2 M' such that 

1 2 M 2 M 

2 sE E -(h fl(h>k;) s v 

j=l 

Since no more than half of the above indices can exceed twice the expected 
value, there must exist indices k\, ■ ■ ■ , €E {ji, ■ • • , J 2 m} for which 

M 

E j>(r' fc . fl( r L) 2 *l<) ^ 2A - * = !. 2 , • • • , M. 

Define the code {it^, I\; i = 1, • • • , M} by Wi = and 

M' 

ri = r; ( - U ( r l>- 



The (Tj )5 are obviously disjoint since we have removed from r' fc . all words within 
25 of a word in any other decoding set. Furthermore, we have for all i = 
1, 2, • • • , M that 



l-e< v{T' k \w' k ) 



= Wki n u ( r L)2s lo + ^ n u ( r L)2^ io 






, m^i 



< Y ki fl( r L)-l<) + ^G^K) 

m^i 
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< 2A„ + (/(Tilwi) 

and hence 



V>(Ti\Wi) > 1 - £ - 8 e ~ n (R' -R-h 2 (26)-26 log(||.B||-l), 



which proves the lemma. □ 

Corollary 12.5.1: Let v be a stationary channel and let C n be a sequence 
of \e nR J,n, e/2) Feinstein codes for n > no- Given an R > 0 and <5 > 0 
such that R < R' — /i 2 (25) — 2<51og(||B|| — 1), there exists for n\ sufficiently large 
a sequence C' n ; n > m, of d-robust (/x n , |_e nii J,n, e) Feinstein codes. 

Proof: The corollary follows from the lemma by choosing ni so that 

e -n 1 (R'-R-h 2 (2S)-25ln(\\B\\-l)-3/n 1 ) < £ Q 

“ 2 ' 

Note that the sources may be different for each n and that ni does not depend 
on the channel input measure. 



12.6 Block Coding Theorems for Noisy Chan- 
nels 

Suppose now that v is a stationary finite alphabet d-continuous channel. Sup- 
pose also that for n > n\ we have a sequence of d-robust ( fi n . \e nR \ ,n,e) Fe- 
instein codes {rUi,r.j} as in the previous section. We now quantify the perfor- 
mance of these codes when used as channel block codes, that is, used on the 
actual channel v instead of on an induced channel. As previously let v n be the 
n-dimensional channel induced by and the channel v, that is, for > 0 

& n {G\a n ) = Pr (Y n G G\X n = a n ) = — l — [ <(G) d^x), (12.23) 

where c(a n ) is the rectangle {x : x G A T ; x n = a™}, a" G A n , and where 
G G B g. We have for the Feinstein codes that 



maxi>"(r?|«7 < ) < e. 

i 

We use the same codewords w t for the channel code, but we now use the ex- 
panded regions (r,;)^ for the decoding regions. Since the Feinstein codes were 
(5-robust, these sets are disjoint and the code well defined. Since the channel is 
d-continuous we can choose an n large enough to ensure that if x n = x n , then 

d n W,V?)<6 2 . 

Suppose that we have a Feinstein code such that for the induced channel 



> 1 - e. 
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Then if the conditions of Lemma 10.5.1 are met and /r„ is the channel input 
source of the Feinstein code, then 



£ w (FiK) = 1 [ v^T^dnix) 

Tn\ w i) Jc(wi) 

< sup ^(r*) < inf v™((T i )s) + 5 

xec (vu) xec(wi) 

and hence 

inf v™((Ti)s) > v n (Yi\wi) - S> 1- e- 6. 

x£.c(wi) 

Thus if the channel block code is constructed using the expanded decoding sets, 
we have that 

max sup v x ((Ti)s) <e + S; 

1 x£c(wi) 

that is, the code {wi, (T^} is a ([e nR \,n,e + S) channel code. We have now 
proved the following result. 

Lemma 12.6.1: Let v be a stationary d-continuous channel and C n ; n > no, 
a sequence of d-robust ( //„ , [e nR \,n, e) Feinstein codes. Then for m sufficiently 
large and each n > ni there exists a ([e nR \ ,n,e + 5) block channel code. 

Combining the lemma with Lemma 12.4.4 and Lemma 12.4.5 yields the 
following theorem. 

Theorem 12.6.1: Let v be an AMS ergodic d-continuous channel. If R < C 
then given e > 0 there is an no such that for all n > no there exist ([e nR \ , n, e) 
channel codes. If the channel is not ergodic, then the same holds true if C is 
replaced by C* . 

Up to this point the channel coding theorems have been “one shot” theorems 
in that they consider only a single use of the channel. In a communication 
system, however, a channel will be used repeatedly in order to communicate a 
sequence of outputs from a source. 

12.7 Joint Source and Channel Block Codes 

We can now combine a source block code and a channel block code of com- 
parable rates to obtain a block code for communicating a source over a noisy 
channel. Suppose that we wish to communicate a source {X n } with a distri- 
bution /j. over a stationary and ergodic d-continuous channel [B,u,B], The 
channel coding theorem states that if K is chosen to be sufficiently large, then 
we can reliably communicate length K messages from a collection of \e KR \ 
messages if R < C. Suppose that R = C — e/2. If we wish to send the 
given source across this channel, then instead of having a source coding rate of 
{K/N) log | \B \ | bits or nats per source symbol for a source ( N , I\ ) block code, we 
reduce the source coding rate to slightly less than the channel coding rate R , say 
-Rsource = (K/N)(R — e/2) = ( K/N){C — e). We then construct a block source 
codebook C of this rate with performance near <5(i? source , /i). Every codeword 
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in the source codebook is assigned a channel codeword as index. The source is 
encoded by selecting the minimum distortion word in the codebook and then 
inserting the resulting channel codeword into the channel. The decoder then 
uses its decoding sets to decide which channel codeword was sent and then puts 
out the corresponding reproduction vector. Since the indices of the source code 
words are accurately decoded by the receiver with high probability, the repro- 
duction vector should yield performance near that of S((K /N)(C — e) , p). Since 
e is arbitrary and S(R,p) is a continuous function of i?, this implies that the 
OPTA for block coding p for v is given by 8{{K/N)C , pj , that is, by the OPTA 
for block coding a source evaluated at the channel capacity normalized to bits 
or nats per source symbol. Making this argument precise yields the block joint 
source and channel coding theorem. 

A joint source and channel (N, K) block code consists of an encoder a : 
A N — ■> B k and decoder {3 : B K — > A N . It is assumed that N source time units 
correspond to K channel time units. The block code yields sequence coders 
a : A T — > B t and (3 : B r — > A T defined by 

a(x) = all i} 

P{x) = {Pixw); all i}. 

Let £ denote the class of all such codes (all N and K consistent with the phys- 
ical stationarity requirement). Let A*(p,u,£) denote the block coding OPTA 
function and D{R 1 p) the distortion-rate function of the source with respect to 
an additive fidelity criterion {/?„}. We assume also that p n is bounded, that is, 
there is a finite value p max such that 

Pn{x , X ) A p max 
n 

for all n. This assumption is an unfortunate restriction, but it yields a simple 
proof of the basic result. 

Theorem 12.7: Let { X n } be a stationary source with distribution p and 
let v be a stationary and ergodic d-continuous channel with channel capacity 
C. Let {p n } be a bounded additive fidelity criterion. Given e > 0 there exists 
for sufficiently large N and K (where K channel time units correspond to N 
source time units) an encoder a : A N — > B K and decoder (3 : B K — » A N such 
that if a : A T — > B r and /3 : B T — > A T are the induced sequence coders, then 
the resulting performance is bounded above as 

A(m, a, i/, P) = E PN (X N , X n ) < 5(*C, p) + e. 

Proof: Given e, choose 7 > 0 so that 

S(^(C- 1 ),p)<S( ] ^C,p) + e - 

and choose N large enough to ensure the existence of a source codebook C of 
length N and rate R sou rce = ( K/N)(C — 7 ) with performance 

p(C • p) A d(f? SOU rce> P) T ■ 




260 



CHAPTER 12. CODING FOR NOISY CHANNELS 



We also assume that N and hence K is chosen large enough so that for a 
suitably small 8 (to be specified later) there exists a channel ( [e KR \ , K, S) code, 
with R = C — 7/2. Index the words in the source codebook by the 

|^ e if(C-7/2j c j lanne i codewords. By construction there are more indices than 
source codewords so that this is possible. We now evaluate the performance of 
this code. 

Suppose that there are M words in the source codebook and hence M of the 
channel words are used. Let and u\ denote corresponding source and channel 
codewords, that is, if Xi is the minimum distortion word in the source codebook 
for an observed vector, then Wi is transmitted over the channel. Let L, denote 
the corresponding decoding region. Then 



M M 

Ep N (X N ,X N ) = EE 

*= 1 j=i • 

M 



x:a(x N )=Wi 



dp{x)v x (Tj)p N (x ,Xj) 



= E / dp(x)vK(ri)p N (x N ,Xi) 

i=\ ^ x:a(x N )—Wi 

M M r 

+ E E / dp,(x)v£(T j )p N (x N ,x j ) 

i= 1 j=i,jjti Jx:a (x N )=m 

M . 

<E/ dp(x)p N (x N ,Xi) 

i—\ J x -°!-(x N )=Wi 

M M r 

+ E E / dp{x)v* {Tj)p N {x N ,Xj) 

i=ij=i,jjti Jx:a ( xN )= w * 

The first term is bounded above by 6(R SOUIcei p) + e/3 by construction. The 
second is bounded above by p m ax times the channel error probability, which is 
less than 5 by assumption. If S is chosen so that p max <5 is less than e/2, the 
theorem is proved. □ 

Theorem 12.7.2: Let {X n } be a stationary source source with distribution 
p and let v be a stationary channel with channel capacity C. Let {p n } be a 
bounded additive fidelity criterion. For any block stationary communication 
system ( p,f,u,g ), the average performance satisfies 



A(/ u,f,v,g)< / dp(x)D(C, p x ), 

J X 



where p is the stationary mean of p and {p x } is the ergodic decomposition of 
p , C is the capacity of the channel, and D(R 1 p) the distortion-rate function. 

Proof: Suppose that the process {X.^ N ,U R K , Y R K ,X^ N } is stationary and 
consider the overall mutual information rate I{X\ X). From the data processing 
theorem (Lemma 9.4.8) 

I(X;X)<p(U-,Y)<^C. 
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Choose L sufficiently large so that 



and 

D n ( — C + e, p) > D( — C + e,p) — 6 

for n > L. Then if the ergodic component p x is in effect, the performance can 
be no better than 



Efi x p N {X n , X 



N 



) > inf Pn(X 

P N eK N (§C+e,n!?) 



N 



X 



N 



) > Dn{ — C + e, p x ) 



which when integrated yields a lower bound of 

J dp(x)D(^C + e, p x ) — 5. 

Since 6 and e are arbitrary, the lemma follows from the continuity of the distor- 
tion rate function. □ 

Combining the previous results yields the block coding OPTA for stationary 
sources and stationary and ergodic d-continuous channels. 

Corollary 12.7.1: Let {X n } be a stationary source with distribution p and 
let v be a stationary and ergodic d-continuous process with channel capacity 
C. Let {pn} be a bounded additive fidelity criterion. The block coding OPTA 
function is given by 



A*(p,v,£,V) 



= j dp(x)D(C, p x ). 



12.8 Synchronizing Block Channel Codes 

As in the source coding case, the first step towards proving a sliding block coding 
theorem is to show that a block code can be synchronized, that is, that the de- 
coder can determine (at least with high probability) where the block code words 
begin and end. Unlike the source coding case, this cannot be accomplished by 
the use of a simple synchronization sequence which is prohibited from appearing 
within a block code word since channel errors can cause the appearance of the 
sync word at the receiver by accident. The basic idea still holds, however, if the 
codes are designed so that it is very unlikely that a non-sync word can be con- 
verted into a valid sync word. If the channel is d-continuous, then good robust 
Feinstein codes as in Corollary 12.5.1 can be used to obtain good codebooks 
. The basic result of this section is Lemma 12.8.1 which states that given a 
sequence of good robust Feinstein codes, the code length can be chosen large 
enough to ensure that there is a sync word for a slightly modified codebook; 
that is, the synch word has length a specified fraction of the codeword length 
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and the sync decoding words never appear as a segment of codeword decod- 
ing words. The technique is due to Dobrushin [33] and is an application of 
Shannon’s random coding technique. The lemma originated in [59]. 

The basic idea of the lemma is this: In addition to a good long code, one 
selects a short good robust Feinstein code (from which the sync word will be 
chosen) and then performs the following experiment. A word from the short 
code and a word from the long code are selected independently and at random. 
The probability that the short decoding word appears in the long decoding word 
is shown to be small. Since this average is small, there must be at least one short 
word such that the probability of its decoding word appearing in the decoding 
word of a randomly selected long code word is small. This in turn implies 
that if all long decoding words containing the short decoding word are removed 
from the long code decoding sets, the decoding sets of most of the original long 
code words will not be changed by much. In fact, one must remove a bit more 
from the long word decoding sets in order to ensure the desired properties are 
preserved when passing from a Feinstein code to a channel codebook. 

Lemma 12.8.1: Assume that e < 1/4 and {C n ;n > ?ro} is a sequence of 
e-robust {r, M(n), n, e/2} Feinstein codes for a d-continuous channel v having 
capacity C > 0. Assume also that h( 2e) + 2elog(||B|| — 1) < C, where B is the 
channel output alphabet. Let 5 £ (0,1/4). Then there exists an ni such that 
for all n > ni the following statements are true. 

(A) If C n = {vi, Fi ; i = 1, • • • , M(n)}, then there is a modified codebook W n = 
{wi', Wi\ i = 1, • • • , K(n)} and a set of K(n) indices /C„ = {fci, • • • , kK( n ) C 
{1, • • • , M(n)} such that w t = v k t , IF/ C (rj) e 2 ; i = 1, • • • , K(n), and 

max sup Vx(Wj) < e. (12.24) 

x£c(wj) 

(B) There is a sync word a £ A r , r = r(n) = [ Sri\ = smallest integer larger 
than Sri, and a sync decoding set S £ B r B such that 

sup Vx(S c ) < e. (12.25) 

xGc(cr) 

and such that no r-tuple in S appears in any n-tuple in IF/; that is, if 
G(b r ) = {y n : y\ = b r some i = 0, • • • ,n — r} and G(S) = LVes G(b r ), 
then 

G(S) p| Wi = 0, * = 1, • • • , K(n). (12.26) 

(C) We have that 

|| {k : k <£ K n }\\ < e5M(n). (12.27) 

The modified code W n has fewer words than the original code C n , but (12.27) 
ensures that W n cannot be much smaller since 



K{ri) > (1 — eS)M(n). 



(12.28) 
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Given a codebook W n = Wp. i = 1, • • • , K(n)}, a sync word a G A r , 
and a sync decoding set S', we call the length n + r codebook {cr x Wi,S x W t : 
i = 1, • • • ,K(n)} a prefixed or punctuated codebook. 

Proof: Since v is d-continuous, 712 can be chosen so large that for n > 712 

max sup d n {v^,v^,) < (^) 2 . (12.29) 

a eA x,x'ec(a n ) ^ 

From Corollary 12.5.1 there is an n-$ so large that for each r > n% there exists 
an e / 2-robust (r, J, r, e /2)-Feinstein code C s = {s j , Sj : j = 1 , • • • , J} ; J >2 tRb , 
where R s € (0,(7 — h( 2e) — 2elog(||£>|| — 1)). Assume that n\ is large enough 
to ensure that 6n\ > n 2 ; dni > n. 3 , and n\ > no- Let If denote the indicator 
function of the set F and define A„ by 

J 1 M(n ) 

A » = ■'"‘E so E f”(G(K).)n r *i»<) 

j=l V ’ i = 1 



J 1 M(n) 

=-'- i e sm £ e e c>to”i» ( )i 0( „(!,") 

7= 1 v ' t=i 



M(n) 



br&(Sj) e y n eTi 
J 



= E E ■>"<»>.> 

v ' z=i 



E E Wy B ) 

7=1 6'e(Sj) e 



(12.30) 



Since the (Sj) e are disjoint and a fixed y n can belong to at most n — r<n sets 
G(b r ), the bracket term above is bound above by n and hence 



A„ < 



n 1 



M(n) 



so that choosing ni also so that n\2~ SnRs < (de) 2 h we have that A„ < (<5e ) 2 if 
n>n\. From (12.30) this implies that for n>n\ there must exist at least one 
j for which 

M(n) 

E ^(G((5 J -) e )fl r ‘l v 0 <(*) 2 

i - 1 

which in turn implies that for n > n\ there must exist a set of indices K. n C 
{!,■■■ ,M (n)} such that 



P n (G((S j ) e )(]T i \v i )<Se,i£lC n , 

||{* : * £ A„}|| < Se. 

Define a = sj; S = (Sj) f / 2 , w. t = v ki , and W x = (T ki f| G((Sj) t ) c ) eS ; i = 
1, • • • , K(n). We then have from Lemma 12.6.1 and (12.29) that if x G c(cr), 
then since eS < e/2 

K(S) = ^((SA/ 2 ) > r{Si\a) \> 1 - e. 
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proving (12.25). Next observe that if y n € ( G((Sj) e ) c ) e g , then there is a b n £ 
G((Sj) e ) c such that d n (y n , b n ) < eS and thus for * = 0, 1, • • • , n — r we have that 



dr(yl,bl)< e ~. 



Since b n € G((Sj) e ) c , it has no r-tuple within e of an r-tuple in Sj and hence 
the r-tuples y\ are at least e/2 distant from Sj and hence y n £ H((S) e / 2 )°). We 
have therefore that ( G((Sj) e ) c ) e s C G((Sj) e ) c and hence 



G(S) f]Wi = G((Sj) e ) f|(r fci n G((Sj)c) c )*e 



cG((S j ) e/2 )f](G((S j ) e ) c ) ge = 0, 

completing the proof. □ 

Combining the preceding lemma with the existence of robust Feinstein codes 
at rates less than capacity (Lemma 12.6.1) we have proved the following syn- 
chronized block coding theorem. 

Corollary 12.8.1 : Le v be a stationary ergodic d-continuous channel and 
fix e > 0 and R £ (0 ,C). Then there exists for sufficiently large blocklength 
N, a length N codebook {a x Wj,S x 14/ i = 1, • • • , M}, M > 2 NR , a £ A r , 
Wi £ A n , r + n = N, such that 



sup ^(S 0 ) < e, 

aiGc(cr) 

max < e, 

i<j<M J 

Wj pi g(s) = 0. 

Proof: Choose 5 £ (0, e/2) so small that C — h(2S) — 2<51og(||B|| — 1) > (1 + 
S)R( 1 — log(l — S 2 )) and choose R' £ ((1 + S)R{ 1 — log(l — <5 2 )), C — h(2S) — 
2<51og(||B|| — 1). From Lemma 12.6.1 there exists an ?r 0 such that for n > no 
there exist d-robust (r, fi, n, S) Feinstein codes with M (n) > 2 nR . From Lemma 
12.8.1 there exists a codebook {iVi, Wy; i = 1, ■ • • , K(n)}, a sync word a £ A r , 
and a sync decoding set S £ B r B , r = [dn] such that 

max sup v r f {Wf ) < 25 < e, 

3 X£c(lVj) 

sup Vx{S) <2 5 < e, 

xGc((T ) 

G(S) f) Wj = 0; j = 1, • • • ,K(n), and from (12.28) 

M = K(n) > (1 - 5 2 )M(n). 



Therefore for N = n + r 

N~ l log M > (nlnS])- 1 log((l - 5 2 )2 nR ') 
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nR! + log(l — 5 2 ) R' + n 1 log(l — 5 2 ) 
n + n5 1 + S 

R' + log(l - S 2 ) 



> 



1 + 5 



> R, 



completing the proof. □ 



12.9 Sliding Block Source and Channel Coding 

Analogous to the conversion of block source codes into sliding block source 
codes, the basic idea of constructing a sliding block channel code is to use a 
punctuation sequence to stationarize a block code and to use sync words to 
locate the blocks in the decoded sequence. The sync word can be used to 
mark the beginning of a codeword and it will rarely be falsely detected during 
a codeword. Unfortunately, however, an r-tuple consisting of a segment of 
a sync and a segment of a codeword may be erroneously detected as a sync 
with nonnegligible probability. To resolve this confusion we look at the relative 
frequency of sync-detects over a sequence of blocks instead of simply trying to 
find a single sync. The idea is that if we look at enough blocks, the relative 
frequency of the sync-detects in each position should be nearly the probability 
of occurrence in that position and these quantities taken together give a pattern 
that can be used to determine the true sync location. For the ergodic theorem 
to apply, however, we require that blocks be ergodic and hence we first consider 
totally ergodic sources and channels and then generalize where possible. 

Totally Ergodic Sources 

Lemma 12.9.1: Let v be a totally ergodic stationary d-continuous channel. 
Fix e, S > 0 and assume that C/v = {cr x uy; S x IT) : i = 1, • • • , A'} is a prefixed 
codebook satisfying (12.24)-(12.26). Let 7 „ : G N — > Cjv assign an TV-tuple 
in the prefixed codebook to each TV-tuple in C N and let [G, /z, U) be an TV- 
stationary, TV-ergodic source. Let c(a n ) denote the cylinder set or rectangle 
of all sequences u = (• • • , u_i, i<o, u\, ■ ■ •) for which u n = a n . There exists 
for sufficiently large L (which depends on the source) a sync locating function 
s : B ln — ■> {0, 1, • • • , TV — 1} and a set <f> € Bq , m = (L + 1)TV, such that if 
u m € and 7 n(U^ n ) = a x Wi, then 

inf v x {y : s(y LN ) = 0,9 = 0,-- ■ ,N-l;y LN £ SxWi) >l-3e. (12.31) 

x6c(7 m (u m )) 

Comments: The lemma can be interpreted as follows. The source is block 
encoded using 7 jv- The decoder observes a possible sync word and then looks 
“back” in time at previous channel outputs and calculates s(y LN ) to obtain the 
exact sync location, which is correct with high probability. The sync locator 
function is constructed roughly as follows: Since /i and v are TV-stationary and 
TV-ergodic, if 7 : A°° —> B°° is the sequence encoder induced by the length 
TV block code 7 jv, then the encoded source /zy -1 and the induced channel 
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output process rj are all TV-stationary and TV-ergodic. The sequence Zj = 
r)(T 3 c(5))); j = ■ ■ ■ , — 1,0, 1, • • • is therefore periodic with period N. Fur- 
thermore, Zj can have no smaller period than N since from (12.24)-(12.26) 
y(T 3 c(S)) < e, j = r + 1, • • • , n — r and rj(c(S)) > 1 — e. Thus defining the 
sync pattern {zj\ j = 0,1, - ■ ■ ,N — 1}, the pattern is distinct from any cyclic 
shift of itself of the form {z/., • • • , z^-i, Zo, ■ ■ ■ , Xk-i}, where k < N — 1. The 
sync locator computes the relative frequencies of the occurrence of S at in- 
tervals of length N for each of N possible starting points to obtain, say, a 
vector z N = (zq, Zi, ■ ■ ■ , Zn~i). The ergodic theorem implies that the z t will 
be near their expectation and hence with high probability (zq, • • ■ , £jv-i) = 
( ze , ze+ 1 , • • • , Zn-i, zo, ■ ■ ■ , ze-i), determining 6. Another way of looking at the 
result is to observe that the sources j = 0, • • ■ , N — 1 are each iV-ergodic 
and iV-stationary and hence any two are either identical or orthogonal in the 
sense that they place all of their measure on disjoint TV-invariant sets. (See, 
e.g., Exercise 1, Chapter 6 of [50].) No two can be identical, however, since 
if r\T l = for i ^ j', 0 < i, j < N — 1, then would be periodic with 
period | i — j\ strictly less than TV, yielding a contradiction. Since membership 
in any set can be determined with high probability by observing the sequence 
for a long enough time, the sync locator attempts to determine which of the 
TV distinct sources ijT 3 is being observed. In fact, synchronizing the output 
is exactly equivalent to forcing the TV sources r)T : > ; j = 0, 1, • • • , TV — 1 to be 
distinct TV-ergodic sources. After this is accomplished, the remainder of the 
proof is devoted to using the properties of d-continuous channels to show that 
synchronization of the output source when driven by /i implies that with high 
probability the channel output can be synchronized for all fixed input sequences 
in a set of high fi probability. 

The lemma is stronger (and more general) than the similar results of Nedoma 
[107] and Vajda [141], but the extra structure is required for application to 
sliding block decoding. 

Proof: Choose £ > 0 so that £ < e/2 and 



£ < 



l 

— mm 

8 i,j:Zi^Zj 



\Zi~Zj |. 



(12.32) 



For a > 0 and 6 = 0, 1, • • • , TV — 1 define the sets a) G B^ N and i/(0, a) € 
m = (L + 1 )N by 

1 L—2 

V>(6£a) = {y LN : 1 s(Vj+iN) ~ z e+i\ — or, j = 0, 1, • • • , TV — 1} 

i>(0,a) = B 9 x a) x B N ~ 6 . 

From the ergodic theorem L can be chosen large enough so that 
N — 1 N—l 

v( n T ~ e om, o)) = v m ( n o) > i - c 2 . 

0=0 0=0 



(12.33) 
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Assume also that L is large enough so that if Xi = x\, i = 0, • • • , m — 1 then 



c? m (z. 






(12.34) 



From (12.33) 



N-l N-l 

c 2 > v m (( n ^o) c ) = e / ^w^)(( n ^(^o c )) 

0=0 a m eGm Jc(am) e=Q 



N-l 

= E ^(« m )^((nW0)l7 m (a™)) 

a m eG m 0=0 

and hence there must be a set $ G S’?) such that 



JV-l 

i> m (( f| ^(0, C)) c |7m(a m )) < C, a m G 4>, (12.35) 

0=0 

y m (<S>) < C- (12.36) 

Define the sync locating function s : B LN — > {0, 1, • • • , N — 1} as follows: 
Define the set ip (6) = {y LN G (ip(0, C)) 2 C/iv} and then define 

s( y LN ) = \ e y LN e W 

( 1 otherwise 

We show that s is well defined by showing that ip (6) C ip(6,AQ, which sets are 
disjoint for 9 = 0, 1, • • • , TV — 1 from (12.32). If y LN G ip{9), there is a b LN G 
ip (6, Q for which dLNpy LN , b LN ) < 2(/N and hence for any j G {0, 1, • • • , TV— 1} 
at most LN(2(/N) = 2(L of the consecutive nonoverlapping TV-tuples y^ + iN , 
i = 0,1, - • • ,L — 2, can differ from the corresponding b^ +iN and therefore 



L—2 



'L-l ^ 

2=0 



l s(y r j+i N ) - ze+j\ 



< I ^ _ i E l‘S(&j+* n) ~ ^+j| + 2C < 3C 

z=0 

and hence y LN G ^(0,4£). If is defined to be B e x / 0($) x B N ~ 6 G 23jg, 
then we also have that 



N-l 

( n c 



0=0 



N-l 

n 



0=0 
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since if y n G (He^c ) 1 C))c /n> then there is a b m such that bg N G ip(9, £); 

9 = 0, 1, • • • , N — 1 and d rn (y rn , b m ) < f/N for 9 = 0, 1, • • • , N — 1. This implies 
from Lemma 12.6.1 and (12.34)-(12.36) that if x G 7 m (a m ) and a m G 4>, then 



JV-l AT-l 

^(f| ^))>C((fl WC))c/iv) 

0=0 0=0 



a 



m 



N -1 

> H fl 0l7 m (a m )) - (12.37) 

0=0 

To complete the proof, we use (12.24)-(12.26) and (12.37) to obtain for 
G 4) and ') m {a,NL N ) = a x that 

MV ■ s(V6 N ) = 0,9 = 0 , 1 , • • • , N - 1 ; y” N G S x Wi) 



N-1 



> fl W)) - X W?) > 1 - e - 2c. □ 

0=0 



Next the prefixed block code and the sync locator function are combined 
with a random punctuation sequence of Lemma 9.5.2 to construct a good sliding 
block code for a totally ergodic source with entropy less than capacity. 

Lemma 12.9.2: Given a d-continuous totally ergodic stationary channel 
v with Shannon capacity C, a stationary totally ergodic source [G, /i, U] with 
entropy rate H([T) < C, and <5 > 0, there exists for sufficiently large n, m 
a sliding block encoder / : G n — ■> A and decoder g : B m — > G such that 
P e (n,v,f,g) <5. 

Proof: Choose R, H < R < C, and fix e > 0 so that e < 5 / 5 and e < 
(R — H) /2. Choose N large enough so that the conditions and conclusions of 
Corollary 12.8.1 hold. Construct first a joint source and channel block encoder 
7jv as follows: From the asymptotic equipartition property (Lemma 3.2.1 or 
Section 3.5), there is an no large enough to ensure that for N > no the set 

G n = {/ : | N-'hniu) -H | > e} 

= {u N : e ~ N < 5+e ) < v(u N ) < e- N {R ~^} (12.38) 

has probability 

Hjjn (Gn) > 1 — e. (12.39) 

Observe that if M' = ||Gjv||, then 

2 N(H-e) < M > < 2 N (H+ e ) < 2 n ( r - € \ (12.40) 

Index the members of Gn as /?*; i = If un = Pi, set 7 jv(«at) 

= a x Wi. Otherwise set 7 jv(^at) = ax wm'+ i- Since for large N, + 1 < 

2 nr , 7 n is well defined. 7 n can be viewed as a synchronized extension of the 
almost noiseless code of Section 3.5. Define also the block decoder tpN(y N ) = Pi 
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if y N € S x Wi\ i = 1 Otherwise set iPn(v N ) = @*, an arbitrary 

reference vector. Choose L so large that the conditions and conclusions of 
Lemma 12.9.1 hold for C and 7at. The sliding block decoder g m : B m — > G, 
to = (L + 1 )N, yielding decoded process U k = 9m(Ykl N L) defined as follows: 

If s{y k ~NL, 1 ) = form b N = rp N {yk~e, ■ ■ ■ , Vk-e-N ) and set Z7 fc (y) = 

gm(yk-NL > • • • , 2/fc+Ar) = bg, the appropriate symbol of the appropriate block. 

The sliding block encoder / will send very long sequences of block words 
with random spacing to make the code stationary. Let K be a large number 
satisfying K e > L + 1 so that m < eKN and recall that N > 3 and L > 1. We 
then have that 



1 1 e 

~KN ~ 3K ~ 6' 



(12.41) 



Use Corollary 9.4.2 to produce a ( KN,e ) punctuation sequence Z n using a 
finite length sliding block code of the input sequence. The punctuation process 
is stationary and ergodic, has a ternary output and can produce only isolated 
0’s followed by KN l’s or individual 2’s. The punctuation sequence is then 
used to convert the block encoder yjv into a sliding block coder: Suppose that 
the encoder views an input sequence u = • • • , u_i, uo, u\, ■ ■ ■ and is to produce a 
single encoded symbol Xq. If u o is a 2, then the encoder produces an arbitrary 
channel symbol, say a*. If Xq is not a 2, then the encoder inspects uq, w_i, U - 2 
and so on into the past until it locates the first 0. This must happen within KN 
input symbols by construction of the punctuation sequence. Given that the first 
1 occurs at, say, Zi = 1„ the encoder then uses the block code 7 jv to encode 
successive blocks of input iV-tuples until the block including the symbol at time 0 
is encoded. The sliding block encoder than produces the corresponding channel 
symbol x$. Thus if Zi = 1, then for some J < Kx 0 = ( 7 jv(u/+jat))j mo d n where 
the subscript denotes that the (Z mod N) th coordinate of the block codeword is 
put out. The final sliding block code has a finite length given by the maximum 
of the lengths of the code producing the punctuation sequence and the code 
imbedding the block code -jn into the sliding block code. 

We now proceed to compute the probability of the error event {u, y : Uo(y) ^ 
Uo(u)} = E. Let E u denote the section {y : Uo(y) ^ U 0 (u)}, f be the sequence 
coder induced by /, and F = {u : Z 0 (u) = 0}. Note that if u G T~ X F, 
then Tu € F and hence Zq(Tu) = Z\(v) since the coding is stationary. More 
generally, if uT~ l F, then Z t = 0. By construction any 1 must be followed by 
KN l’s and hence the sets T~ l F are disjoint for i = 0, 1, • • • , KN — 1 and hence 
we can write 



P e = Pr ([/ 0 ^ Uo) = yv{E) 
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KN-l .. 

= LNfi(F) + / d/j,(u)iSf {u) (E u ) + ea < 2e 

i—LN jT ~ iF 



KN-l r 

+ E E / ^ d T(u') F f( u ' ) (y' ■ U 0 (u') ± Uo(u')), 

i—LN a kN £G kN 

(12.42) 

where we have used the fact that fi(F) < ( KN ) _1 (from Corollary 9.4.2) and 
hence LN n(F) < L/K < e. Fix i = kN + j; 0 < j < N — 1 and define 
u = Tl +LN u' and y = T J+L,v y', and the integrals become 



'eT~^Ff]c(a KN )) 



d(J‘(u')vf( u ,)(y' ■■ U 0 (u') ± gm(Y- NL (y')) 



: / d(J,(u')Pf, T -U + LN) u) (y 

Ju£T-( k - L ) N (F[\c(a KN )) 

U 0 (T1 +ln u) ± g m {Y_NL m {T> +NL y))) 



J 



u eT-( k ~ L '> N (F P| c{a KN )) 



dii(v!)vj( T -(j+LN) u )(y : Uj+LN 



* gmivT)) = [ 

J u 



6 T _ (fe _ L)iV(F p| c(o KN)) 



dfi{u') 



xVf {T -u+LN) u) (y : u% N = 4 >n(Vln) or s(y*f N ± j)). (12.43) 

If u% N = (3j G G n , then u^ N = iPn(Vln) if Vln e s x W»- If u G 
T~^ k ~ L ) N c (a KN ), then u m = oI ( n k _ T ^ N and hence from Lemma 12.9.1 and sta- 
tionarity we have for i = kN + j that 



E 

a KN G QKN 



IT- 



(c(a KN ) (~) F) 



dii(u)vf (u) (E u ) 



<3e Li(T-( k ~ L '> N (c(a KN )f]F)) 

(1 kn e G kn 

“Tk-L)N£*r\(G LN XG N ) 

+ E M(7 1 - (fe - i)JV (c(a^ Ar )f|F 1 )) 

a KN € g kn 

a? k - L)N ?*r\(G LN xG n ) 

<3e E M c ( aifJV )D F )) 

a K N {zQK N 



E 



y(c(a KN )f]F)) 
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< 3 efi(F) + /x(c(4> c ) f| F) + vl(c(G n ) f| F ) . (12.44) 

Choose the partition in Lemmas 9.5. 1-9. 5. 2 to be that generated by the sets 
c(4> c ) and c(G N ) (the partition with all four possible intersections of these sets 
or their complements). Then the above expression is bounded above by 

3e 6 C < 5 6 
NK + NK + NK - b lVK 

and hence from (12.42) 

P e < 5e < S (12.45) 

which completes the proof. □ 

The lemma immediately yields the following corollary. 

Corollary 12.9.1: If v is a stationary d-continuous totally ergodic channel 
with Shannon capacity C, then any totally ergodic source [G, /x, U] with < 
C is admissible. 



Ergodic Sources 

If a prefixed blocklength N block code of Corollary 12.9.1 is used to block encode 
a general ergodic source [G, (i, U\, then successive iV-tuples from /x may not be 
ergodic, and hence the previous analysis does not apply. From the Nedoma 
ergodic decomposition [106] (see, e.g., [50], p. 232), any ergodic source /lx can be 
represented as a mixture of A^-ergodic sources, all of which are shifted versions 
of each other. Given an ergodic measure /x and an integer N, then there exists 
a decomposition of /i into M N-e rgodic, iV-stationary components where M 
divides N, that is, there is a set II £ Bq such that 

t m n = n (12.46) 



f) T j H) = 0; i , j < M, j (12.47) 



M— 1 



M IJ Tin ) = 1 

i = 0 



/x(n) 



l 

M’ 



such that the sources [G, Hi,U\, where tt^W) 
are A^-ergodic and Af-stationary and 






Mn(wC\Tm) 



V(W) 



M—l M—l 

n rn )- 

i = 0 i — 0 



(12.48) 



This decomposition provides a method of generalizing the results for totally 
ergodic sources to ergodic sources. Since /x(-|II) is A^-ergodic, Lemma 12.9.2 is 
valid if ^ is replaced by /x(-|II). If an infinite length sliding block encoder / is 
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used, it can determine the ergodic component in effect by testing for T~ l H in 
the base of the tower and insert i dummy symbols and then encode using the 
length N prefixed block code. In other words, the encoder can line up the block 
code with a prespecified one of the IV-possible iV-ergodic modes. A finite length 
encoder can then be obtained by approximating the infinite length encoder by 
a finite length encoder. Making these ideas precise yields the following result. 

Theorem 12.9.1: If v is a stationary d-continuous totally ergodic channel 
with Shannon capacity C, then any ergodic source [G, y., U] with H(y) < C is 
admissible. 

Proof: Assume that N is large enough for Corollary 12.8.1 and (12.38)- 
(12.40) to hold. From the Nedoma decomposition 

M - 1 

- £ y N {G N \T n) = y N (G N ) > 1 - e. 

i = 0 

and hence there exists at least one i for which 

y N (G N \TU) > 1-e; 

that is, at least one iV-ergodic mode must put high probability on the set Gn 
of typical A-tuples for /i. For convenience relabel the indices so that this good 
mode is /x(- |II) and call it the design mode. Since /z(-|II) is JV-ergodic and N- 
stationary, Lemma 12.9.1 holds with y replaced by /z(-|II); that is, there is a 
source /channel block code ( 7 jvjV’Jv) and a sync locating function s : B LN —> 
{0, 1, • • • , M — 1} such that there is a set <I> € G m ; m = (L + 1)N, for which 
(12.31) holds and 

y m {<S> |n) > 1-e. 

The sliding block decoder is exacted exactly as in Lemma 12.9.1. The sliding 
block encoder, however, is somewhat different. Consider a punctuation sequence 
or tower as in Lemma 9.5.2, but now consider the partition generated by 4>, Gn, 
and T l II, i = 0, 1, • • • , M — 1. The infinite length sliding block code is defined 
as follows: If u ^ Ufcfo 1 T k F , then f(u) = a * , an arbitrary channel symbol. If 
u € T l (F P| T~lH) and if i < j, set f(u ) = a* (these are spacing symbols to force 
alignment with the proper iV-ergodic mode). If j < i < KN — ( M — j), then 
i = j + kN + r for some 0 < k < (K — 1) N, r < N — 1. Form G]\r('U^ +kN ) = a N 
and set f{u) = a r . This is the same encoder as before, except that if u G T 7 II, 
then block encoding is postponed for j symbols (at which time u £ II). Lastly, 
if KN — ( M — j) < i < KN — 1, then f(u) = a* . 

As in the proof of Lemma 12.9.2 

Pe(n,vJ,g m ) = J d^{u)v f{u) {y : U 0 {u) ± 5m(5^iv(2/))) 



KN - 1 



< 2e+ ^ 



i—LN 



u € T' l Fdy,(;u)vf( u ){y : U 0 (u) ± Uo(y)) 
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KN-1 M—l 

= 2e + E E E 

i=LN j = 0 a KN eG KN 



lueT i (c(a KN )f]Ff]T-m) 



d^{u)v f ( u ){y : U 0 (u) ± U 0 (y)) 



M—l KN-(M-j) 

^+E E E 

j — 0 i=LN-\-j a KN £G kn 



lueT i (c(a KN )f > \Ff > \T-m ) 



dii(u)v f{u) (y : U 0 (u) ± U 0 {y )) 



M—l 

+ E 

i= o 



where the rightmost term is 

M — l 



M E n T ” in ) - — < 4 < e. 
i= o 



A A “ I< 



Thus 



(12.49) 



M — l KN-(M-j) 

Pe([i, f , f, 9m ) < 3e + E E E 

j=0 i—LN+j a KN eG KN 

/ dy,(u)v f(u) (y : U 0 (u) ± U 0 (y)). 

Ju&T i (c(a KN ) P| F 0 T~ in) 

Analogous to (12.43) (except that here i = j + kN + r, u = T~^ LN+r ^u') 



lu'&T i (c(a KN ) P Fp T-Jn) 



dn{u')vf( u >){y' : U 0 (u’) = g m (Y™ LN {y'))) 



< / d/j,(u) 

J T j+(.k-L)N ( c ( a KNj p p P T~i n) 

Vf(T*+LNu){V ■ ULN P ^ N(yLN)ors(yr N ) P P- 

Thus since u € T^ + ^ k ~0 N [c{a KN ) P-FPT _ llI implies u m = aJ l + ^ k _ L ^ )N , anal- 
ogous to (12.44) we have that for i = j + kN + r 

E / d n(u)v f(u) {y ■ Uq(u) ± g m (Y-LN m (y))) 

a KN (zQKN ^-(C(a“) P FP T-m) 



= e 



T- J n)) 



ft K N • q TTl $ 

j + (k-L)N^ 
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E ^c(a KN )f]Ff]T-m) 

a KN. a m £=<T> 

a - a j + (k-L) N ^ 

+ E li(c(a KN )f]Ff]T-m) 

° K "™? +lh - z . )N e* 

= e^(T- (i+(/c - i)JV) c(4>) f>P| r_in ) 

+M(T- (i+(fc - i)jv) C (4>)' pi f n T~ j n). 

From Lemma 9.5.2 (the Rohlin-Kakutani theorem), this is bounded above 
by 



/x (T-w+( fe - i ) JV ) c ($) n T~m) At (T-w+( fc -- L ) JV )c($) c n r _i n) 

6 Z/v + Ziv 

/i(T-b' + ( fe - i ) Ar ) c ($)|T-Jn)/i(n) jU (r-w+( fc - I ') JV )c($) c |T-jn)/i(n) 



KN 



KN 

2e 



= c,(c(*)|n)d| K cWin)d| + < MJfJv . 



With (12.48)-(12.49) this yields 



„ . , . „ MKN 2e _ 

Peiv, v, J, 9 rn) < 3e + MKN < 5e, 



(12.50) 



which completes the result for an infinite sliding block code. 

The proof is completed by applying Corollary 10.5.1, which shows that by 
choosing a finite length sliding block code / 0 from Lemma 4.2.4 so that Pr(/ ^ 
fo) is sufficiently small, then the resulting P e is close to that for the infinite 
length sliding block code. □ 

In closing we note that the theorem can be combined with the sliding block 
source coding theorem to prove a joint source and channel coding theorem simi- 
lar to Theorem 12.7.1, that is, one can show that given a source with distortion 
rate function D(R) and a channel with capacity C, then sliding block codes 
exist with average distortion approximately D(C). 
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